read: MVDiffusion generates multi-view images

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

Table of contents

Code | Arxiv (2307) | ProjPage

Abs & Intro

  1. Fine-tune the pre-trained text-to-image diffusion model (SD)
  2. Insert cross-attention blocks between UNet blocks;
  3. Generate multiple views in parallel using a SD, and fuse multi views by attention;
  4. Freeze pre-trained weights while training the attention blocks

Solving problems:

  1. Generating panorama
  2. Extrapolate one perspective image to a full 360-degree view

Preliminary

MVDiffusion derives from LDM (Latent Diffusion Model¹), which contains 3 modules:

  1. VAE for transfering the generation process to a latent space,
  2. denoising model (UNet) for sampling from the distribution of the inputs’ latent codes,
  3. condition encoder for providing descriptors.

Loss function is similar to original diffusion model: the MSE between original noise and predicted noise, which conditioned on noisy latents, timestep, and featues.

$$L_{LDM} = ∑$$

Convolution layers are insert in each UNet block:

  • Feature maps at each level will be added into UNet blocks.
Figure 2

Figure 2

Pipeline

pixel-to-pixel correspondences enable the multi-view consistent generation of panorama and depth2img, because they have homography matrix and projection matrix to determine the matched pixel pairs in two images.

Panorama

Text-conditioned model: generate 8 target images from noise conditioned by per-view text prompts.

  • The final linear layer in CAA blocks are initialized to zero to avoid disrupt the SD’s original capacity.

  • Multiple latents will be predict by UNet from noise images and then restored to images by the pre-trained VAE’s decoder.

Image&Text-conditioned model: generate 7 target images based on 1 condition image and respective text prompts.

  • Based on SD’s impainting modle as it takes 1 condition image.

    1. Don’t impaint the condition image by concatenating the noise input with a all-one mask (4 channels in total)

    2. Compeletly regenerate the input image by concatenating the noise input with a all-zero mask (4 channels in total)

    3. Concatenating all-one and all-zero channel during training make the model learn to apply different processes to condition image and target image.

depth2img

S d & T e e e q p p x u t o t e h s n e p c m s r e a o p m o s p f t I A m n a T y g M S k e e i u e x t & d b y t m w i T m d s - o o m e o l e f c d a x d e t r o e c g t e a n l o e - l i o m d n s c m f e i s o a s t e n g i c d e o u i n t t e i i d v o e n e d k e i y m - a f g r e a s m e

This two-stage design is because SD’s impainting modle doesn’t support depth map condition.

  • So the Text-conditioned model is reused to generate condition images.
  • Then the Image&Text-conditioned model interpolate the two condition images.

Correspondence-aware attention

Aggregate the features of KxK neighbor pixels on every target feature maps to each pixel their own feature vector.

  • The source pixel $s$ perform positional encoding γ(0)

  • The neighbor pixel $t_\*^l$ around the corresponding pixel $t^l$ on the target image $l$ perform position encoding $γ(s_\*^l-s)$, which means the neighbor pixel $t_\*^l$ need to be warpped back to source feature map to find the distance from $s_\*^l$ to the source pixel $s$.

    Figure 3

    Figure 3


Ref

Built with Hugo
Theme Stack designed by Jimmy