read: Render - NVS | Match-NeRF

Explicit Correspondence Matching for Generalizable Neural Radiance Fields

Notes

Take the difference between reference image features of a 3D point as geometry prior.

(2023-10-25) Method is like a simplified GPNR: feature differences between views measued by cosine similarity (dot product). However, GPNR is more ultimate, while MatchNeRF only computes differences for each two views and differences are taken averaged as the input feature.
(2023-12-09) Image features alone cannot constrain geometry of the unseen scenes well. In contrast, MVSNet preset depths explicitly.

And MVSNet used the variance of features to measure the depth misalignment among all feature maps, whereas MatchNeRF used differences of each 2 feature maps.

CNN extract image feature (1/8 reso), which will be upsampled to (1/4 reso), for each reference view.
Select a pair of reference views and Mix their feature maps by cross-attention (GMFlow).
Project a 3D point onto this pair of interacted feature maps.
Measuring the difference of feature vectors by cosine similarity (dot prodcut)
- The feature difference indicates whether the 3D point is at surface, so that it provides a geometry prior.
Dot product of two feature vectors is a scalar, which will lost much information. So they divided the total channel into many groups to do “group dot product”. And concatenate the dot product of each group as a vector 𝐳.

Also, for the 1/4 feature map, there is a “dot-products” vector $\\^𝐳$ for a pair of reference views.
Given 𝑁 reference views, there are 𝑁(𝑁-1)/2 pairs of reference views, corresponding to 𝑁(𝑁-1)/2 “difference” vectors, which will merge together by taking their element-wise average as a single 𝐳
This “feature difference” vector 𝐳 (geometry prior) is fed along with the 3D point’s position and viewdir into decoder (MLP and ray-transformer), which regresses the color and volume density.

Settings are following MVSNeRF.

Datasets:

Stage	Data	Contents	Resolution	N_views
Train	DTU	88 scenes	512x640	49
Test	DTU	16 scenes		3
Test	NeRF real	8 scenes	640x960	4
Test	Blender	8 scenes	800x800	4

(2023-08-24)

The architectures of Match-NeRF and GNT are similar.

(2023-12-09) Overview: Souce images’ features are extracted, mixed and regressed to rgbσ.

Match-NeRF is trained only on DTU dataset, while GNT can be trained on multiple datasets (gnt_full).
GNT merges multiple source views via subtract attention, while Match-NeRF fuses multi-view feature maps before getting into the model.
Match-NeRF mixes the entire feature maps for each two reference views, and then project 3D points onto the fused feature maps to index feature vectors.

However, GNT directly mixes point’s feature vectors coming from each feature maps.

Different training settings:

Hyper-params	GNT	MatchNeRF
#rays for grad-descnt	2048	1024
#source views	8~10	3

1080Ti only supports --nerf.rand_rays_train=512 for MatchNeRF.

The opts.batch_size will be divided evenly to each gpu (self.opts.batch_size // len(self.opts.gpu_ids)), so bs (num of images)=1 cannot be split to multiple GPUs.

And if setting bs=2, each card still have to process 1024 rays selected from an image.
Testing with 1 1080Ti:python test.py --yaml=test --name=matchnerf_3v --nerf.rand_rays_test=10240

(2023-09-27)

GMFlow uses 6 transformer blocks consisting of self_attn and cross_attn for fusing windows, where the 1st and odd blocks perform window shift.
MatchNeRF fully-finetuned the pre-trained GMFlow.
Does Inner product of a pair of features come from GMFlow?
- (2023-10-25) I guess it’s a simplified attention only for pairs, instead of among all views.
- (2023-12-09) Inferring geometry from the difference in high-dimensional features may have been present even earlier than MVSNet.
Self-attn and cross-attn for two samples data1 and data2 can be done in a single transformer block of GMFlow by concating 2 samples in the batch dimension twice in different order, i.e., source=[‘data1’,‘data2’] and target=[‘data2’,‘data1’].

Such that self-attn is performed on source and source. And cross-attn is source and target. If fused source returned after a block, the order requires reverse again to form the new target.
viewdir didn’t perform positional embedding (same as PixelNeRF).
Ray transformer (MHA) in decoder mixes 16-dim feature vectors. (Unexpectedly tiny)
Take the nearest views.
Coordinates of 3D points are projected onto the image plane of the source view 0 to do positional embedding. Code
The encoder is supposed to provide the overall (geometry) prior, so they emphasized in the paper:

do not tune encoder for per-scene fine tuning