Featured image of post read: NVS with Pose-conditioned Diffusion Models

read: NVS with Pose-conditioned Diffusion Models

Novel View Synthesis with Diffusion Models (ICLR 2023)

Code-pytroch | Arxiv | OpenReview

Q&A

  1. Condition image vs target image?

Abstract

  1. A img2img diffusion model is conditioned with pose and a single source view to generate multiviews.

  2. Stochastic conditioning: Randomly select a view from avaliable views as condition image at each denoising step during sampling?, rather than using only the given view.

  3. Reconstruct a NeRF to measure 3D consistency of multi-views.

    • NeRF is not their ultimate objective.

Intro

  1. Regressive methods for NVS from sparse views based on NeRF are still not generalizable enough or able to produce high-quality completion for the occluded parts.

    • Regularized NeRF (RegNeRF) suffer from artifacts when only few views are given because they and didn’t apply the features of commen prior of multiple scenes.

    • Regressing a NeRF from image feataures (pixel-NeRF) tend to get blurred images.

  2. Geometry-free methods for NVS obtain colors that aren’t directly derived from volume rendering.

    • Light field network
    • Scene Representation Transformer
    • EG3D combines StyleGAN and volume rendering
  3. 3D diffusion model is a generative and geometry-free method.

    • Use pairs of images of the same scene to train a diffusion model.
    • During training, one of them serves as the original, and the other is the condition image.
    • The trained model can produce a multi-view set of a scene given one condition image.

Model

They consider multiple views from a scene are not independent, but follow the distribution of the training views, to enhance multi-view consistency.

  • The distributions of different views, given a scene with a total observation set ๐’, $p(๐ฑ|S)$ are conditionally independent (different).

  • NeRF solves NVS under an even strict condition: each ray in the scen is conditionally independent.

  • However, with this nature, the diffusion model cannot guarantee the samplings (generated images), conditioned with different source view, follow a common distribution, i.e., the diffusion model needs a unique distribution to learn.

    Ideally, the common distribution should be p(S), but it’s difficult to approximate the entire scene based on sparse views. (Not sure, my guess.)

    That’s why they reused the generated views previously for later condition.

Pose-conditioned

Given the data distribution p(๐ฑโ‚, ๐ฑโ‚‚), diffusion model learns the distribution of one of the two images conditioned on the other image and both poses.

  1. Noise schedule involving signal-to-noise ratio ฮป.
  2. Loss function of DDPM

Stochastic condition

Figure 3: Stochastic conditioning sampler

Figure 3: Stochastic conditioning sampler

  1. Markovian model didn’t perform well, where the next image is conditioned on (k) previously generated views. Thus, a scene can be represented as $p(๐—) = โˆแตขp(๐ฑแตข|๐ฑ_{<i})$.

    • Using all previous sampled views is imfesible due to the limited memory.

    • They found k=2 can achieve 3D consistency, but more previous states impair the sampling quality.

    • And instead of conditioning on the last few samplings as in the Markovian model, 2 views are stochastically selected as condition images at each denoising step.

  2. Generating a new view $๐ฑโ‚–โ‚Šโ‚$ needs 256 denoising steps, where each time the condition image ๐ฑแตข is randomly chosen from the current views set ๐œฒ = {๐ฑโ‚,…,๐ฑโ‚–}.

    • In a denoising step, noise in the intermediate image $\hat ๐ฑโ‚–โ‚Šโ‚$ will be subtracted from $๐ณโ‚–โ‚Šโ‚^{(ฮปโ‚œ)}$, which follows a forward noising distribution ๐’’, given the noisy image $๐ณโ‚–โ‚Šโ‚^{(ฮปโ‚œโ‚‹โ‚)}$ of last step and the denoised image $\hat ๐ฑโ‚–โ‚Šโ‚$.

      $$ \hat ๐ฑโ‚–โ‚Šโ‚ = \frac{1}{\sqrt{ฯƒ(ฮปโ‚œ)}} \left( ๐ณโ‚–โ‚Šโ‚^{(ฮปโ‚œ)} - \sqrt{ฯƒ(-ฮปโ‚œ)}\ ฮต_ฮธ(๐ณโ‚–โ‚Šโ‚^{(ฮปโ‚œ)}, ๐ฑแตข) \right) \\ \ \\ ๐ณโ‚–โ‚Šโ‚^{(ฮปโ‚œโ‚‹โ‚)} \sim q(๐ณโ‚–โ‚Šโ‚^{ฮปโ‚œโ‚‹โ‚};\ ๐ณโ‚–โ‚Šโ‚^{(ฮปโ‚œ)}, \hat ๐ฑโ‚–โ‚Šโ‚ ) $$

      • ( I guess ๐’’ gets “reversed” after applying Bayes rule)

      • The first noisy image $๐ณโ‚–โ‚Šโ‚^{(ฮป_T)}$ is Gaussian N(0,๐ˆ).

    • After 256 steps finished, add the result image ๐ฑโ‚–โ‚Šโ‚ to set ๐œฒ.

    • 256 can be larger to cover all the existing views.

    • This scheme approximate the true autoregressive sampling.

      • Autoregressive model always use all previous states to predict the next state, unlike Markov chain only considering limited recent outputs,

        Therefore, to train an autoregressive model, a sequence, i.e. multi-view training data here, is needed.

        “True autoregressive sampling needs a score model of the form $log\ q(๐ณโ‚–โ‚Šโ‚^{(ฮป)} | ๐ฑโ‚,…,๐ฑโ‚–)$ and multi-view training data.”

      • But they’re not interesting in multiple source views here.

X-UNet

Figure 4: X-UNet Architecture

Figure 4: X-UNet Architecture

UNet with only self-attention fails to generate images with multi-view consistency, given limited training images.

  1. Different frame has a different noise-level
  2. Positional encoding of pose is the same size of feature maps
  3. Use cross-attention to make two images attend to each other.

Inputs:

Concat two images? such that the weights of Conv2d and self-attention layers are shared for the noisy image and condition image

Experiments

Dataset: SRN ShapeNet (synthetic cars and chairs) github

file size
cars_train.zip 3.26GB
chairs_train.zip 60.3GB

Use instant-NGP without view dependent modules