StyleMVD: Tuning-Free Image-Guided Texture Stylization by Synchronized Multi-View Diffusion

1RebuilderAI, 2KAIST

StyleMVD, Our mesh texture stylization method enables high quality texture style transfer from a input style image while preserve the original content of mesh's texture using the pretrained diffusion model.

Abstract

In this paper, we propose StyleMVD, a high quality texture generation framework guided by style image reference for textured 3D meshes. Style transfer in images has been extensively researched, but there has been relatively little exploration of style transfer in 3D meshes. Unlike in images, the key challenge in 3D lies in generating a consistent style across views. While existing methods generate mesh textures using pretrained text-to-image (T2I) diffusion models, accurately expressing the style of an image in text remains challenging. To overcome this, we propose StyleMVD module that enables to transfer the consistency style among the different views without additional fine-tuning from the reference style image. Specifically, StyleMVD converts existing self-attention into sample-wise self-attention to achieve global coherence, and decouples cross-attention in diffusion models to inject the style feature from the reference style image. Our experiments show that the proposed StyleMVD can achieve impressive results in both consistent image-to-texture style transfer and texture quality at high speed.


Method


method

The overview of StyleMVD pipeline. We first render the input textured mesh to generate condition images and view prompts based on the camera poses. Text embeddings are then extracted from the view prompts, while an image embedding is obtained from the input style image. Each view-dependent text embedding is concatenated with the image embedding and forwarded into the StyleMVD with condition images. After the view-dependent denoising process, the stylized images are unprojected onto the input mesh.


method

Illustrations of the U-Net architectures in (a) Latent Diffusion Model (LDM) and (b) StyleMVD. The basic blocks of U-Net in LDM comprise a residual block, self-attention, and cross-attention, initially designed to accept only text embeddings. In our StyleMVD, we modify the self-attention to sample-wise self-attention and the cross-attention to decoupled cross-attention to facilitate 3D mesh texture stylization from the reference image.





Comparisons to other methods

results




Citation

Please consider citing our work if you find it useful.


      @article{kim2024stylemvd,
        title = {{StyleMVD: Tuning-Free Image-Guided Texture Stylization by Synchronized Multi-View Diffusion}},
        author = {Kim, Kunho and An, Sanghyeon and Sung, Minhyuk},
        journal = {arxiv preprint arXiv:2109.00000},
        year = {2024},
      }