read: Render - NVS

一句话：用图像特征训练NeRF，（所以有泛化能力,不局限于单场景，不像nerf只做优化）

如图1，光线是随机从像素空间选择的（为了随机选一些(N个)空间点），空间点投影到不同视图的 feature maps (512 chnls) 上，得到 (N∗Nviews个) feature volume，送入 mlp 变换成 r,g,b,density，再做 volume rendering

Abstract

NeRF conditioned by few images
costly independent optimizing
use convolution to learn scene prior
no explicit 3D supervision
single image novel view synthesis task

1 Introduction

Topic: Synthesis novel views for a scene from sparse views
Problem
Former solution & drawbacks:
- Differentiable neural rendering: represent the sence as a neural network which generates rendered images
- NeRF: encode the 3D position and view dirs to volume density and color. It needs too many images.
Task: predicting NeRFs from one or several images.
Contributions: Utilize pixel-aligned spatial image features to learn scene priors (specifically) Input image → convolutional feature grid → 𝐱,𝐝,feature(residual) → NeRF (pool-based multi-view features fusion) → σ,rgb
Results statement
- no 3D supervision: 3D shape or object masks
- not trained in a canonical coordinate system, but in each camera coordinate system
- “convolution preserves the spatial alignment between the image and the output 3D representation”
- flexibility on number of input views in test period
(Experiments) ShapeNet; DTU

Novel View Synthesis
Learning-based 3D reconstruction
Viewer-centric 3D reconstruction

3 Background NeRF

(job) (input-output) x,d -> density, color
(training manner) volume rendering
(loss func)
Limitation: only use geometric consistency, resulting in individual optimization for every view

4 Image-conditioned NeRF

(improvement) input spatial image features into network. (Specifically) Two components: convolutional encoder + nerf mlp. Spatial query position is drawn from camera space.
(paragraph order) From simple case to general case

4.1 Single-Image pixelNeRF

(main idea) The inputs are all from view space.
(steps) Input image → feature volume W=E(I) → assign image feature to sample points by projecting the 3D position to 2D location → NeRF network
(pipeline)
(the role of query view direction)

4.2 Incorporating Multiple Views

(main idea) extract information from multi views to resolve geometric ambiguities and able to accept multiple images in test time. No relation with world space
World space can base on any view. (notion explain)
Project the world query point into each input view space. The fisrt layer processes each view independently and the final layer aggregates all views.
(model pipeline)
1. Encode each input img into feature volume and retrieve the corresponding feature at the sample point’s projecting location.
2. The features companied with (xyz,viewdirs) are input to the first layer and all the intermediate vectors are fused by average pooling before entering the final layers, which compress the vector into density and color.

5 Experiments

Datasets:
1. ShapeNet (category-specific and category-agnostic view synthesis);
2. ShapNet scenes with unseen categories and multiple objects;
3. DTU rescaled to 1/4: 400x300
Baselines: SRN, DVR
Metrics: PSNR, SSIM, LPIPS
Implementation Details:
- Image encoder: ResNet34;
- Features: non-pooling, upsampled using bilinear interpolation, concatenated feature maps of every image.
- Combine the points’ position and view direction in a ResNet manner (residual)

5.1 ShapeNet Benchmarks

ShapeNet for category-specific and category-agnostic

Category-specific View Synthesis Benchmark

one-shot and two-shot view synthesis
A model is trained for reconstructing one category and the used images contains 50 random views per object instance. Only 2 are encoded and fed into network.
Ablations. The benefit of local features

Category-agnostic Object Prior

Train a single model to the 13 largest categories of ShapeNet
pick one view randomly from 24 fixed elevation view of each object instance.

5.2 Pushing the Boundaries of ShapeNet

(tasks) less controlled capture scenarios in ShapeNet:
1. unseen object categories,
2. multiple-object scene,
3. simulation-to-real transfer on car images.
Generalization to novel categories: apply the model on unseen categories
Multiple-object scenes: The geometric features need can be applied in any view direction (360°)

5.3 Scene Prior on Real Images

Code

(2023-07-20)

calc_losses():

\begin{algorithm}
\caption{Main steps of clac$\_$losses()}
\begin{algorithmic}
  \PROCEDURE{clac-losses}{}
  \STATE Sample 4 objects from 88 training scanned objects (folder) as a super batch
  \STATE Each object samples randomly 1 or 3 from its total of 49 NV as target images for training
  \STATE Sample a batch of rays (N$\_$rand)
  \STATE Extract latent feature maps `self.latent`
  \STATE Sample z$\_$coarse points on the rays
  \STATE Transform points onto NS camera space
  \STATE Perspective project xyz to 2D location uv on feature maps
  \STATE Indexing latent vectors from feature maps by `F.grid$\_$sample()`
  \ENDPROCEDURE
\end{algorithmic}
\end{algorithm}

(2024-06-06)

NeuRay: PixelNeRF is an analogy to stereo matching. Intro

MVSNet projects multiple hypotheses depth planes onto source images (feature maps) and then matches multi-view features with a UNet. Similarly, PixelNeRF projects multiple points on each ray at sampled depth onto the source images.
MVSNet don’t have the problem of occulsion because all it deals with is a photo (plane) without 3D structure from begining to end.

Whereas NeRF handles the entire 3D space, the occulusion could happen when viewpoint changes.

NeuRay use image features to determine if a sample point on a ray is visble or not.

DTU dataset

(2023-08-17)

get_split_dataset() (in “data/init.py”) will construct different Dataset objects for different datasets.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


flags["list_prefix"] = "new_"
if training:
  flags["max_imgs"] = 49
flags["sub_format"] = "dtu"
flags["scale_focal"] = False
flags["z_near"] = 0.1
flags["z_far"] = 5.0

# Apply color jitter during train
train_aug = ColorJitterDataset
train_aug_flags = {"extra_inherit_attrs": ["sub_format"]}

train_set = DVRDataset(path=datadir, stage="train", **flags, **kwargs)
if train_aug is not None:
  train_set = train_aug(train_set, **train_aug_flags)

read: Render - NVS | PixelNeRF

pixelNeRF: Neural Radiance Fields from One or Few Images (CVPR 2021)

Table of contents

Abstract

1 Introduction

3 Background NeRF

4 Image-conditioned NeRF

4.1 Single-Image pixelNeRF

4.2 Incorporating Multiple Views

5 Experiments

5.1 ShapeNet Benchmarks

5.2 Pushing the Boundaries of ShapeNet

5.3 Scene Prior on Real Images

Code

DTU dataset

Table of contents

Abstract

1 Introduction

2 Related Work

3 Background NeRF

4 Image-conditioned NeRF

4.1 Single-Image pixelNeRF

4.2 Incorporating Multiple Views

5 Experiments

5.1 ShapeNet Benchmarks

5.2 Pushing the Boundaries of ShapeNet

5.3 Scene Prior on Real Images

Code

DTU dataset