Featured image of post read: Render - NVS | Match-NeRF

read: Render - NVS | Match-NeRF

Table of contents

Explicit Correspondence Matching for Generalizable Neural Radiance Fields

Code | Arxiv

Notes

Idea

Take the difference between reference image features of a 3D point as geometry prior.

  • (2023-10-25) Method is like a simplified GPNR: feature differences between views measued by cosine similarity (dot product). However, GPNR is more ultimate, while MatchNeRF only computes differences for each two views and differences are taken averaged as the input feature.

  • (2023-12-09) Image features alone cannot constrain geometry of the unseen scenes well. In contrast, MVSNet preset depths explicitly.

    And MVSNet used the variance of features to measure the depth misalignment among all feature maps, whereas MatchNeRF used differences of each 2 feature maps.

Pipeline

  1. CNN extract image feature (1/8 reso), which will be upsampled to (1/4 reso), for each reference view.

  2. Select a pair of reference views and Mix their feature maps by cross-attention (GMFlow).

  3. Project a 3D point onto this pair of interacted feature maps.

  4. Measuring the difference of feature vectors by cosine similarity (dot prodcut)

    • The feature difference indicates whether the 3D point is at surface, so that it provides a geometry prior.
  5. Dot product of two feature vectors is a scalar, which will lost much information. So they divided the total channel into many groups to do “group dot product”. And concatenate the dot product of each group as a vector 𝐳.

    Also, for the 1/4 feature map, there is a “dot-products” vector $\\^𝐳$ for a pair of reference views.

  6. Given 𝑁 reference views, there are 𝑁(𝑁-1)/2 pairs of reference views, corresponding to 𝑁(𝑁-1)/2 “difference” vectors, which will merge together by taking their element-wise average as a single 𝐳

  7. This “feature difference” vector 𝐳 (geometry prior) is fed along with the 3D point’s position and viewdir into decoder (MLP and ray-transformer), which regresses the color and volume density.


Experiments

Settings are following MVSNeRF.

Datasets:

Stage Data Contents Resolution N_views
Train DTU 88 scenes 512x640 49
Test DTU 16 scenes 3
Test NeRF real 8 scenes 640x960 4
Test Blender 8 scenes 800x800 4
  • Device: 16G-V100

Play


(2023-08-24)

Compare with GNT

The architectures of Match-NeRF and GNT are similar.

  • (2023-12-09) Overview: Souce images’ features are extracted, mixed and regressed to rgbσ.
  • Match-NeRF is trained only on DTU dataset, while GNT can be trained on multiple datasets (gnt_full).

  • GNT merges multiple source views via subtract attention, while Match-NeRF fuses multi-view feature maps before getting into the model.

  • Match-NeRF mixes the entire feature maps for each two reference views, and then project 3D points onto the fused feature maps to index feature vectors.

    However, GNT directly mixes point’s feature vectors coming from each feature maps.

  • Different training settings:

    Hyper-params GNT MatchNeRF
    #rays for grad-descnt 2048 1024
    #source views 8~10 3
    • 1080Ti only supports --nerf.rand_rays_train=512 for MatchNeRF.

      The opts.batch_size will be divided evenly to each gpu (self.opts.batch_size // len(self.opts.gpu_ids)), so bs (num of images)=1 cannot be split to multiple GPUs.

      And if setting bs=2, each card still have to process 1024 rays selected from an image.

    • Testing with 1 1080Ti:python test.py --yaml=test --name=matchnerf_3v --nerf.rand_rays_test=10240


(2023-09-27)

Code Details

  • GMFlow uses 6 transformer blocks consisting of self_attn and cross_attn for fusing windows, where the 1st and odd blocks perform window shift.

  • MatchNeRF fully-finetuned the pre-trained GMFlow.

  • Does Inner product of a pair of features come from GMFlow?

    • (2023-10-25) I guess it’s a simplified attention only for pairs, instead of among all views.

    • (2023-12-09) Inferring geometry from the difference in high-dimensional features may have been present even earlier than MVSNet.

  • Self-attn and cross-attn for two samples data1 and data2 can be done in a single transformer block of GMFlow by concating 2 samples in the batch dimension twice in different order, i.e., source=[‘data1’,‘data2’] and target=[‘data2’,‘data1’].

    Such that self-attn is performed on source and source. And cross-attn is source and target. If fused source returned after a block, the order requires reverse again to form the new target.

  • viewdir didn’t perform positional embedding (same as PixelNeRF).

  • Ray transformer (MHA) in decoder mixes 16-dim feature vectors. (Unexpectedly tiny)

  • Take the nearest views.

  • Coordinates of 3D points are projected onto the image plane of the source view 0 to do positional embedding. Code

  • The encoder is supposed to provide the overall (geometry) prior, so they emphasized in the paper:

    do not tune encoder for per-scene fine tuning

Built with Hugo
Theme Stack designed by Jimmy