Table of contents
Explicit Correspondence Matching for Generalizable Neural Radiance Fields
Notes
Idea
Take the difference between reference image features of a 3D point as geometry prior.
-
(2023-10-25) Method is like a simplified GPNR: feature differences between views measued by cosine similarity (dot product). However, GPNR is more ultimate, while MatchNeRF only computes differences for each two views and differences are taken averaged as the input feature.
-
(2023-12-09) Image features alone cannot constrain geometry of the unseen scenes well. In contrast, MVSNet preset depths explicitly.
And MVSNet used the variance of features to measure the depth misalignment among all feature maps, whereas MatchNeRF used differences of each 2 feature maps.
Pipeline
-
CNN extract image feature (1/8 reso), which will be upsampled to (1/4 reso), for each reference view.
-
Select a pair of reference views and Mix their feature maps by cross-attention (GMFlow).
-
Project a 3D point onto this pair of interacted feature maps.
-
Measuring the difference of feature vectors by cosine similarity (dot prodcut)
- The feature difference indicates whether the 3D point is at surface, so that it provides a geometry prior.
-
Dot product of two feature vectors is a scalar, which will lost much information. So they divided the total channel into many groups to do “group dot product”. And concatenate the dot product of each group as a vector 𝐳.
Also, for the 1/4 feature map, there is a “dot-products” vector $\\^𝐳$ for a pair of reference views.
-
Given 𝑁 reference views, there are 𝑁(𝑁-1)/2 pairs of reference views, corresponding to 𝑁(𝑁-1)/2 “difference” vectors, which will merge together by taking their element-wise average as a single 𝐳
-
This “feature difference” vector 𝐳 (geometry prior) is fed along with the 3D point’s position and viewdir into decoder (MLP and ray-transformer), which regresses the color and volume density.
Experiments
Settings are following MVSNeRF.
Datasets:
| Stage | Data | Contents | Resolution | N_views |
|---|---|---|---|---|
| Train | DTU | 88 scenes | 512x640 | 49 |
| Test | DTU | 16 scenes | 3 | |
| Test | NeRF real | 8 scenes | 640x960 | 4 |
| Test | Blender | 8 scenes | 800x800 | 4 |
- Device: 16G-V100
Play
(2023-08-24)
Compare with GNT
The architectures of Match-NeRF and GNT are similar.
- (2023-12-09) Overview: Souce images’ features are extracted, mixed and regressed to rgbσ.
-
Match-NeRF is trained only on DTU dataset, while GNT can be trained on multiple datasets (
gnt_full). -
GNT merges multiple source views via subtract attention, while Match-NeRF fuses multi-view feature maps before getting into the model.
-
Match-NeRF mixes the entire feature maps for each two reference views, and then project 3D points onto the fused feature maps to index feature vectors.
However, GNT directly mixes point’s feature vectors coming from each feature maps.
-
Different training settings:
Hyper-params GNT MatchNeRF #rays for grad-descnt 2048 1024 #source views 8~10 3 -
1080Ti only supports
--nerf.rand_rays_train=512for MatchNeRF.The
opts.batch_sizewill be divided evenly to each gpu (self.opts.batch_size // len(self.opts.gpu_ids)), so bs (num of images)=1 cannot be split to multiple GPUs.And if setting bs=2, each card still have to process 1024 rays selected from an image.
-
Testing with 1 1080Ti:
python test.py --yaml=test --name=matchnerf_3v --nerf.rand_rays_test=10240
-
(2023-09-27)
Code Details
-
GMFlow uses 6 transformer blocks consisting of
self_attnandcross_attnfor fusing windows, where the 1st and odd blocks perform window shift. -
MatchNeRF fully-finetuned the pre-trained GMFlow.
-
Does Inner product of a pair of features come from GMFlow?
-
(2023-10-25) I guess it’s a simplified attention only for pairs, instead of among all views.
-
(2023-12-09) Inferring geometry from the difference in high-dimensional features may have been present even earlier than MVSNet.
-
-
Self-attn and cross-attn for two samples
data1anddata2can be done in a single transformer block of GMFlow by concating 2 samples in the batch dimension twice in different order, i.e.,source=[‘data1’,‘data2’] andtarget=[‘data2’,‘data1’].Such that self-attn is performed on
sourceandsource. And cross-attn issourceandtarget. If fusedsourcereturned after a block, the order requires reverse again to form the newtarget. -
viewdir didn’t perform positional embedding (same as PixelNeRF).
-
Ray transformer (MHA) in decoder mixes 16-dim feature vectors. (Unexpectedly tiny)
-
Take the nearest views.
-
Coordinates of 3D points are projected onto the image plane of the source view 0 to do positional embedding. Code
-
The encoder is supposed to provide the overall (geometry) prior, so they emphasized in the paper:
do not tune encoder for per-scene fine tuning