read: MVDiffusion generates multi-view images

Abs & Intro

Fine-tune the pre-trained text-to-image diffusion model (SD)
Insert cross-attention blocks between UNet blocks;
Generate multiple views in parallel using a SD, and fuse multi views by attention;
Freeze pre-trained weights while training the attention blocks

Solving problems:

Generating panorama
Extrapolate one perspective image to a full 360-degree view

Preliminary

MVDiffusion derives from LDM (Latent Diffusion Model¹), which contains 3 modules:

VAE for transfering the generation process to a latent space,
denoising model (UNet) for sampling from the distribution of the inputs’ latent codes,
condition encoder for providing descriptors.

Loss function is similar to original diffusion model: the MSE between original noise and predicted noise, which conditioned on noisy latents, timestep, and featues.

$$L_{LDM} = ∑$$

Convolution layers are insert in each UNet block:

Feature maps at each level will be added into UNet blocks.

Pipeline

pixel-to-pixel correspondences enable the multi-view consistent generation of panorama and depth2img, because they have homography matrix and projection matrix to determine the matched pixel pairs in two images.

Panorama

Text-conditioned model: generate 8 target images from noise conditioned by per-view text prompts.

The final linear layer in CAA blocks are initialized to zero to avoid disrupt the SD’s original capacity.
Multiple latents will be predict by UNet from noise images and then restored to images by the pre-trained VAE’s decoder.

Image&Text-conditioned model: generate 7 target images based on 1 condition image and respective text prompts.

Based on SD’s impainting modle as it takes 1 condition image.
1. Don’t impaint the condition image by concatenating the noise input with a all-one mask (4 channels in total)
2. Compeletly regenerate the input image by concatenating the noise input with a all-zero mask (4 channels in total)
3. Concatenating all-one and all-zero channel during training make the model learn to apply different processes to condition image and target image.

depth2img

This two-stage design is because SD’s impainting modle doesn’t support depth map condition.

So the Text-conditioned model is reused to generate condition images.
Then the Image&Text-conditioned model interpolate the two condition images.

Correspondence-aware attention

Aggregate the features of KxK neighbor pixels on every target feature maps to each pixel their own feature vector.

The source pixel $s$ perform positional encoding γ(0)
The neighbor pixel $t_*^l$ around the corresponding pixel $t^l$ on the target image $l$ perform position encoding $γ(s_*^l-s)$, which means the neighbor pixel $t_*^l$ need to be warpped back to source feature map to find the distance from $s_*^l$ to the source pixel $s$.

Figure 3

Ref

High-Resolution Image Synthesis with Latent Diffusion Models - Robin Rombach