- Surfaced by NeRF&Beyond 12.5ๆฅๆฅ(CustomNeRF๏ผVideoRF๏ผ็ฝๆ ผๅผๅฏผ็ผ่พ๏ผSANeRF-HQ๏ผGPS-Gaussian๏ผไธคไธชGaussianAvatar๏ผSpalTAM๏ผgsplat๏ผ - Jason้ชไฝ ็ป็ปๆ็ๆ็ซ - ็ฅไน
- Feature image from: Understanding the Covariance Matrix - Janakiev
Projection
Old Notes on 2023-12-05
NDC is the de-homogeneous clip coordinates and ranges [-1,1].
De-homogeneous means the z value has been divided.
Mapping the coordinates of point (x,y,z) in camera space to clip coordinates, i.e., a cube of [-1,1] can be decomposed to two operations: perspective projection and scaling ranges, and then compound them.
$$ \begin{array}{ccc} \begin{bmatrix}x_{clip} \\ y_{clip} \\ z_{clip} \\ w_{clip} \end{bmatrix} = \begin{bmatrix} โก & โก & โก & โก \\ โก & โก & โก & โก \\ โก & โก & โก & โก \\ โก & โก & โก & โก \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ 1 \end{bmatrix} \end{array} $$
The individual perspective projection:
$$ \begin{bmatrix} fโ & 0 & 0 \\ 0 & f_y & 0 \\ 0 & 0 &1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} $$
After that, the plane coordinates are (u, v), wher $u = \frac{fโx}{z}, v=\frac{f_yy}{z}$.
Then scaling the ranges:
-
Scale the range of u from [-w/2,w/2] to [-1,1] through a linear mapping: ฮฑ u + ฮฒ
ฮฑ and ฮฒ be solved based on two points.
$$ \begin{array}{cc} \begin{cases} ฮฑ (-w/2) + ฮฒ = -1 \\ ฮฑ w/2 + ฮฒ = 1 \end{cases} โ \begin{cases} ฮฑ = 2/w \\ ฮฒ = 0 \end{cases} \end{array} $$
-
Similarly, scale the range of v from [-h/2,h/2] to [-1,1] through a linear mapping: ฮฑ v + ฮฒ
Thus, $ฮฑ= 2/h, ฮฒ=0$
So far, the first 2 rows are determined:
$$ \begin{array}{ccc} \begin{bmatrix}x_{clip} \\ y_{clip} \\ z_{clip} \\ w_{clip} \end{bmatrix} = \begin{bmatrix} 2fโ/w & 0 & 0 & 0 \\ 0 & 2f_y/h & 0 & 0 \\ โก & โก & โก & โก \\ โก & โก & โก & โก \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ 1 \end{bmatrix} \end{array} $$
-
When scaling z, it has nothing to do with x and y. Thus, the 3rd row is 0 0 โก โก.
Beacuse NDC is the de-homogeneous clip coordinates, which requires divide by z to become NDC. Therefore, the 4-th row is 0 0 1 0.
$$ \begin{array}{ccc} \begin{bmatrix}x_{clip} \\ y_{clip} \\ z_{clip} \\ w_{clip} \end{bmatrix} = \begin{bmatrix} 2fโ/w & 0 & 0 & 0 \\ 0 & 2f_y/h & 0 & 0 \\ 0 & 0 & A & B \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ 1 \end{bmatrix} \end{array} $$
With denoting the two unknowns as A and B, the NDC of z dimension is: $\frac{A z + B}{z}$.
According to the range constraint [-1,1], the A and B can be solved from:
$$ \begin{array}{cc} \begin{cases} \frac{A n + B }{n} = -1 \\ \frac{A f + B }{f} = 1 \end{cases} โ \begin{cases} A = (f+n)/(f-n) \\ B = -2fn/(f-n) \end{cases} \end{array} $$
Finally, the mapping from the camera coordinates of a point to corresponding clip coordinates is:
$$ \begin{array}{ccc} \begin{bmatrix}x_{clip} \\ y_{clip} \\ z_{clip} \\ w_{clip} \end{bmatrix} = \begin{bmatrix} 2fโ/w & 0 & 0 & 0 \\ 0 & 2f_y/h & 0 & 0 \\ 0 & 0 & \frac{f+n}{f-n} & \frac{-2fn}{f-n} \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ 1 \end{bmatrix} \end{array} $$
NDC Mapping
(2024-01-01)
The perspective division (a 3D pixel coordinates are divided by the 3rd dimension) should be performed as the final step in the transformation pipeline, as it’s a non-linear operation.
Given a 3D point (x,y,z)แต located in the camera space, the perspective projection and scaling are carried out in sequence to obtain its clip coordinates (not NDC yet).
$$ \begin{array}{c} \text{[Scaling Matrix] [Perspective Projection] [Camera space] = [Clip space]} \\ \\ \begin{bmatrix} โก & โก & โก & โก \\ โก & โก & โก & โก \\ โก & โก & โก & โก \\ โก & โก & โก & โก \end{bmatrix} \begin{bmatrix} fโ & 0 & cโ & 0 \\ 0 & f_y & c_y & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = \begin{bmatrix} x_c \\ y_c \\ z_c \\ w_c \end{bmatrix} \end{array} $$
The scaling matrix is built with the goal of mapping the view frustum to a [-1,1] NDC-space cube encompassing only valid points, whose clip coordinates satisfy: $-w_c < x_c,y_c,z_c < w_c$. Specifically, the projected coordinates are scaled and then perform perspective division to become the NDC.
-
Perspective projection:
$$ \begin{bmatrix} fโ & 0 & cโ & 0 \\ 0 & f_y & c_y & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = \begin{bmatrix} fโx + cโ z \\ f_y y + c_y z \\ z \\ 1 \end{bmatrix} $$
-
Scaling projected coordinates u and v to [-1, 1]
To scale u, the 1st row of the scaling matrix is A 0 B 0.
Perform scaling first, followed by perspective division, to obtain the x in NDC space: $\frac{A (fโ x + cโ z) + Bz}{z} = A (\frac{fโ x}{z} + cโ) + B โ [-1,1]$
Since $(\frac{fโ x}{z} + cโ) = u โ [0,W]$
$$ \begin{cases} A0 + B = -1 \\ AW + B = 1 \end{cases} โ \begin{cases} A = \frac{2}{W} \\ B = -1 \end{cases} $$
Therefore, the first 2 rows are:
$$ \begin{bmatrix} 2/W & 0 & -1 & 0 \\ 0 & 2/H & -1 & 0 \\ โก & โก & โก & โก \\ โก & โก & โก & โก \end{bmatrix} \begin{bmatrix} fโx + cโ z \\ f_y y + c_y z \\ z \\ 1 \end{bmatrix} $$
-
Scaling frustum $z โ [n, f]$ to [-1,1]
z is independent to x and y, so the 3rd row only has 2 unknows: 0 0 A B.
Scaling first, then perspective division, thereby the z in NDC space is: $\frac{A z + B}{z} โ [-1, 1]$
Substituting z = n and f:
$$ \begin{cases} \frac{A n + B}{n} = -1 \\ \frac{A f + B}{f} = 1 \end{cases} โ \begin{cases} A = \frac{f+n}{f-n} \\ B = \frac{-2fn}{f-n} \end{cases} $$
Finally, since the denominators are z (i.e., the w of a point’s clip coordinates is z), the 4th row is 0 0 1 0:
$$ \begin{bmatrix} 2/W & 0 & -1 & 0 \\ 0 & 2/H & -1 & 0 \\ 0 & 0 & \frac{f+n}{f-n} & \frac{-2fn}{f-n} \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} fโx + cโ z \\ f_y y + c_y z \\ z \\ 1 \end{bmatrix} = \begin{bmatrix} \frac{2 (fโx + cโ z)}{W} - z \\ \frac{2 (f_y y + c_y z) }{H} - z \\ \frac{f+n}{f-n} z - \frac{2fn}{f-n} \\ z \end{bmatrix} $$
The result coordinates are in the clip space, which will become ND coordinates after perspective division, and only points within the cube of [-1,1] in NDC space will be rendered.
In summary, the Projection Matrix (GL_PROJECTION) transforming camera-space (x,y,z)แต to clip coordinates is:
$$ P = \begin{bmatrix} \frac{2}{W} & 0 & -1 & 0 \\ 0 & \frac{2}{H} & -1 & 0 \\ 0 & 0 & \frac{f+n}{f-n} & \frac{-2fn}{f-n} \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} fโ & 0 & cโ & 0 \\ 0 & f_y & c_y & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} \frac{2fโ}{W} & 0 & (\frac{2cโ}{W}) -1 & 0 \\ 0 & \frac{2f_y}{H} & (\frac{2c_y}{H}) -1 & 0 \\ 0 & 0 & \frac{f+n}{f-n} & \frac{-2fn}{f-n} \\ 0 & 0 & 1 & 0 \end{bmatrix} $$
-
When cx and cy are W/2 and H/2, it’s in the form of Ye’s article.
-
When cx and cy are r and t, it’s in the form of songho.
(2024-01-02)
Project mean vector ๐
A mean vector ๐ in world space is changed to pixel space as follows:
-
๐ญ refers to coordinates of ๐ in the camera space as: $๐ญ = \begin{bmatrix} ๐_{w2c} & ๐ญ_{w2c} \\ 0 & 1 \end{bmatrix} \begin{bmatrix} \bm ฮผ \\ 1 \end{bmatrix}$
-
While the translation vector is denoted as $๐ญ_{w2c}$.
-
And the extrinsics is represented as $๐_{w2c}$.
-
The clip coordinates of ๐ is $๐ญ’ = ๐๐ญ$
-
The nonlinear perspective division is approximated by the projective transformation $ฯโ(๐ญ’)$.
-
-
Points’ coordinates conversion from world space to clip space:
$$ \begin{aligned} ๐ญ’ &= ๐โ ๐_{w2c}โ [^{\bm \mu}_1] \\ &=\begin{bmatrix} \frac{2fโ}{W} & 0 & (2cโ/W) -1 & 0 \\ 0 & \frac{2f_y}{H} & (2c_y/H) -1 & 0 \\ 0 & 0 & \frac{f+n}{f-n} & \frac{-2fn}{f-n} \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} R_{w2c} & t_{w2c} \\ 0 & 1 \end{bmatrix} \begin{bmatrix} ฮผโ \\ ฮผ_y \\ ฮผ_z \\ 1 \end{bmatrix} = \begin{bmatrix} tโ’ \\ t_y’ \\ t_z’ \\ t_w’ \end{bmatrix} \end{aligned} $$
- $t_w’$ is the point’s camera-space depth $t_z$, which is > 0.
-
Frustum clipping (clip-space culling) filters points that won’t appear in the frustum based on clip coordinates before perspective division:
View Frustum Clipping - UofTexas - Lec9
-
View frustum clipping aims to reduce computation. Essentially, it filters points by comparing the clip coordinate $w_c$ with $x_c, y_c, z_c$ for each point.
-
In the previous derivation, clip coordinates are the scaled projection coordinates: (A*ProjCoord+B). NDC space is defined by $w$, as the NDC cube is constrained by bounds where the clip coordinates divided by ๐ค equals 1 (๐ค is the benchmark), such as $\frac{A (f_x x+c_x z) + Bz}{w} = 1$
In other words, the NDC planes wrap around points whose clip coordinates $x_c,y_c,z_c$ less than or equal to $w_c$. In addition, $w_c$ must be the camera-space depth z for the final perspective division. Thus, if the clip coordinate of a point is bigger than w or less than $-w$, the “quotient” will be outside of [-1,1], i.e., the point is not located in the camera-space view frustum or the NDC-space cube.
Frustum clipping retains points that satisfy: $-w_c \leq x_c, y_c, z_c \leq w_c$. Conversely, those points whose w (equals camera-space depth) is smaller than x,y,z will be filtered out.
Although ND Coordinates are also able to identify points for clipping, to reduce the number of perspective divisions (executed at final), clipping is performed in the clip space for efficiency.
-
On the other hand, since the ND Coordinates of the Left, Right, Bottom, Top, Near, and Far frustum planes are -1 and 1, the clip coordinates of points located within the Frustum satisfy the relation: $-1 < \frac{x_c}{w_c}, \frac{y_c}{w_c}, \frac{z_c}{w_c} <1$
-
Consequently, only points in the frustum (i.e., NDC-space cube: $-1 < xโ, yโ, zโ < 1$) are survived.
-
(2024-01-03) It is the view frustum clipping that makes ND coordinates can be regarded as a cube space, because out-of-the-cube points have been disregared.
-
-
Performing perspective division on the clip coordinates resulting in NDC, whose all components ranges in [-1,1]:
$$ NDC = \begin{bmatrix} tโ’/t_w’ \\ t_y’/t_w’ \\ t_z’/t_w’ \\ 1 \end{bmatrix} โ [-1,1] $$
- (2024-02-08) NDC is 3D, including zโ coord besides the 2D pixel coords.
-
Scaling NDC to obtain pixel coordinates ๐’ (viewport transformation): songho
$$ [-1, 1] \overset{รW}{โ} [-W, W] \overset{+1}{โ} [-W+1,W+1] \overset{รท2}{โ} [\frac{-W+1}{2}, \frac{W+1}{2}] \\ \overset{+c_x}{โ} [0.5, W+0.5] (\text{ if $c_x =\frac{W}{2}$}) $$
Therefore, the final pixel coordinate ๐’ of a world-space mean vector ๐ is:
$$ \bm \mu’ = \begin{bmatrix} (Wโ tโ’/t_w’ + 1) /2 + c_x \\ (Hโ t_y’/t_w’ + 1) / 2 + c_y \end{bmatrix} $$
Project Covariance
(2024-01-03)
-
Because the perspective projection is not a linear operation due to division, the 2D projection of a 3D Gaussian is not a 2D Gaussian:
snowball splat diffuse Is it a 2D Gaussian? Conic sections -
EWA Splatting doesn’t scale the projected coordinates to the [-1,1] NDC-space cube. It directly transforms points from camera space onto screen (or viewing-ray space) by dividing z. The result coordinates range in [0,W] and [0,H].
-
Projective transformation ฯ(๐ญ) in EWA Splatting: Converting an arbitrary point’s camera-space coordinates $๐ญ=(tโ,t_y,t_z)แต$ to the coordinates in 3D ray space (pixel coordinates x0,x1 + “new depth” x2) has 2 steps:
- Pixel coords = perspective projection + perspective division.
- The new depth is set to the L2 norm of the point’s camera-space coordinates.
$$ ฯ(๐ญ) = \begin{bmatrix} \frac{fโ tโ}{t_z} + cโ \\ \frac{f_y t_y}{t_z} + c_y \\ \sqrt{t_x^2 + t_y^2 + t_z^2} \end{bmatrix} = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \end{bmatrix} $$
Because EWA Splatting doesn’t consider frustum clipping, the approximation is based on the camera coordinates ๐ญ.
-
(2024-02-16) The clip coordinates shouldn’t be used in EWA splatting because the Gaussian center is used in the Jacobian that approximates the perspective projection, where the camera-space coordinates are supposed to be used.
Whereas, 3DGS (or gsplat) requires to determine whether the point (Gaussian center) is in the frustrum to be rendered, the clip coordinates are utilized for frustum clipping. Therefore, the world coordinates ๐ฑ are involved into 2 procedures: the covertion for covariance matrix from world space to ray space (JW๐บWแตJแต), and the projection for Gaussian center from world space onto the screen.
Consequently, the derivative of Loss w.r.t. the world coordinates ($\frac{โL}{โ๐ฑ}$) has 2 portions.
However,
gsplatuses the clip space to filter points outside the camera frustum. So, after perspective projection and scaling for NDC with matrix ๐, points are transferred into clip space as ๐ญ’ for clipping.After clipping, the nonlinear perspective division and xโ reassignment in ฯ(๐ญ), are approximated with an affine transformation based on the clip coordinates ๐ญ’. Therefore, the projective transformation $ฯ(๐ญ)$ that maps camera space to ray space becomes a mapping from clip space to the ray space $ฯ(๐ญ’)$.
$$ \begin{aligned} ๐ญ &โ ๐ญ’=๐๐ญ = \begin{bmatrix} \frac{2 (fโ t_x + cโ t_z)}{W} - t_z \\ \frac{2 (f_y t_y + c_y t_z) }{H} - t_z \\ \frac{f+n}{f-n} z - \frac{2fn}{f-n} \\ t_z \end{bmatrix} = \begin{bmatrix}๐ญ_x’ \\ t_y’ \\ t_z’ \\ t_z\end{bmatrix} \\ &โ ฯ(๐ญ’) = \begin{bmatrix} tโ’/t_z’ \\ t_y’/t_z’ \\ โ๐ญ’โโ \end{bmatrix} = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \end{bmatrix} \end{aligned} $$
The affine approximation is the first 2 terms of $ฯ(๐ญ’)$’s Taylor expansion evaluated at a Gaussian’s mean vector $๐ญโ’ = (t_{k,x}’, t_{k,y}’, t_{k,z}’)แต$ in the clip space:
$$ \begin{aligned} & ฯ(๐ญ’) โ ฯโ(๐ญ’) = ฯ(๐ญโ’) + ๐_{๐ญโ’} โ (๐ญ’ - ๐ญโ’) \\ & = \begin{bmatrix} t_{k,x}’/t_{k,z}’ \\ t_{k,y}’/t_{k,z}’ \\ โ๐ญโ’โยฒ\end{bmatrix} + \begin{bmatrix} 1/t_{k,z}’ & 0 & -t_{k,x}/{t_{k,z}’}^2 \\ 0 & 1/t_{k,z}’ & -t_{k,y}/{t_{k,z}’}^2 \\ t_{k,x}’/โ๐ญโ’โโ & t_{k,y}’/โ๐ญโ’โโ & t_{k,z}’/โ๐ญโ’โโ \end{bmatrix} (๐ญ’ - ๐ญโ’) \end{aligned} $$
If using the camera-space coordinates ๐ญ to express the projective transformation as ฯ(๐ญ), focal lengths will be exposed:
$$ \begin{aligned} ฯ(๐ญ) โ ฯโ(๐ญ) &= ฯ(๐ญโ)+ ๐_{๐ญโ} โ (๐ญ - ๐ญโ) \\ &=\begin{bmatrix} 2(fโ t_{k,x}/t_{k,z} + cโ)/W -1 \\ 2(f_y t_{k,y}/t_{k,z} + c_y)/H -1 \\ โ๐ญโโโ \end{bmatrix} + ๐_{๐ญโ} โ (๐ญ - ๐ญโ) \end{aligned} $$
If $c_x=W/2,\ c_y=H/2$, then $ฯโ(๐ญ) = \begin{bmatrix} \frac{2f_x t_{k,x}}{Wโ t_{k,z}} \\ \frac{2f_y t_{k,y}}{Hโ t_{k,z}} \\ โ๐ญโโโ \end{bmatrix}$
and the Jacobian $๐_{๐ญโ}$ will be:
$$ ๐_{๐ญโ} = \begin{bmatrix} (2/W)โ fโ/t_{k,z} & 0 & (2/W)โ -fโ t_{k,x} / {t_{k,z}}^2 \\ 0 & (2/H)โ f_y/t_{k,z} & (2/H)โ -f_y t_{k,y} / {t_{k,z}}^2 \\ t_{k,x}/โ๐ญโโโ & t_{k,y}/โ๐ญโโโ & t_{k,z}/โ๐ญโโโ \end{bmatrix} $$
-
If the camera-film coords xโ[0,W] and yโ[0,H] are not scaled to [-1,1], the scaling factors 2/W and 2/H won’t exist.
Thus, the Jacobian evaluated at center ๐ญโ in the camera space is:
$$ ๐_{๐ญโ} = \begin{bmatrix} fโ/t_{k,z} & 0 & -fโ t_{k,x} / {t_{k,z}}^2 \\ 0 & f_y/t_{k,z} & -f_y t_{k,y} / {t_{k,z}}^2 \\ t_{k,x}/โ๐ญโโโ & t_{k,y}/โ๐ญโโโ & t_{k,z}/โ๐ญโโโ \end{bmatrix} $$
-
By expressing the projective transformation ฯโ() with the camera coordinates ๐ญ, the relation between ray-space coordinates and camera-space coordinates is constructed. Thereby, the derivative of Gaussian center ฯโ(๐ญโ) in the ray space is derived from camera-space coordinates ๐ญโ, with the clip coordinates ๐ญโ’ skipped.
For the case of 2D projection, only x and y dimensions of the covariance matrix need consideration, with the 3rd row and column are omitted.Because of the affine approximation, the projective transformation ฯ(๐ญ’) becomes a linear operation. Thereby, a 3D Gaussian after perspective division is a 2D Gaussian.- Figuratively, points surrounding the 3D Gaussian center in clip space will fall into an ellipse on the 2D screen.
- In the 3DGS code, the 2D ellipse is further simplified as a circle to count the overlapped tiles.
-
The covariance matrix ๐บโ’ of the 2D Gaussian in the pixel space corresponding to the 3D Gaussian (๐บโ) in the world space can be derived based on properties of Gaussians as:
$$\bm ฮฃโ’ = ๐โ ๐_{w2c} \bm ฮฃโ ๐_{w2c}แต ๐โแต$$
-
Because covariance matrix ๐บ’ is symmetric, it can be decomposed to a stretching vector (diagonal matrix) and a rotation matrix by SVD, analogous to describing the configurations of an ellipsoid ^3DGS. (Essentially, the covariance matrix depicts a data distribution.)
The rotation matrix is converted to a quaternion during optimization.
In summary: A 3D Gaussian (๐โ,๐บโ) in world (object) space is transformed into 3D ray space (or the screen with the 3rd dim omitted), resulting in:
-
Mean vector: $\bm ฮผโ’ = ฯ(๐ญโ’)$, where $๐ญโ’ = ๐โ [^{๐_{w2c} \ ๐ญ_{w2c}}_{0 \quad\ 1}] โ [^{\bm ฮผโ}_1]$ is clip coordinates.
-
Covariance matrix: $\bm ฮฃโ’ = ๐โโ ๐_{w2c}โ \bm ฮฃโโ ๐_{w2c}แตโ ๐โแต$
-
A point ๐ญ’ in clip space within the Gaussian distribution is projected to ray space: $ฯโ(๐ญ’) = ฯ(๐ญโ’) + ๐โโ (๐ญ’ - ๐ญโ’)$. The discrepancy between the approximated and the real projected locations is $ฯโ(๐ญ’)-ฯ(๐ญ’)$.
Rasterizing
Sorting Kernels
(2024-01-09)
Sort Gaussians within each 16x16 tile based on depth
-
Every pixel has a perpendicular dot line, and the discs intersected with the line are visible to the pixel.
-
The depths of those discs are $xโ = โ๐ญ’โโ = \sqrt{ {tโ’}ยฒ + {tโ’}ยฒ + {tโ’}ยฒ}$, L2 norm of the clip coordinates.
-
The disc closer to the screen is more prominent.
-
Different images are obtained given different viewing rays, as the opacities of the discs change in tandem with viewing rays. (Specifically, the opacity of a disc is an integral for the Gaussian in the 3D ray space along a viewing ray.)
In contrast, volume rendering method changes point-wise colors on different viewing rays.
-
(2024-01-22) In splatting, image formation still relies on alpha compositing. Distinct from NeRF where a pixel-ray originates from the camera optical center, in splatting, a pixel emits a perpendicular ray from the screen. And it is the incoming viewing rays determine varying alpha (opacity) of the filters on the pixel-ray path. Such that the screen displays diverse colors with various viewing rays.
(2024-04-20)
-
I think the “screen” is indeed the camera film. The “viewing ray” doesn’t hit the screen". The viewing ray is just required to pass through 3D Gaussians to calculate each Gaussian’s opacity by integrating the 3D Gaussian over the intersecting segment.
Rendering a pixel in the splatting method also emitting a ray from a pixel and compositing discs on the ray. The difference is that the opacity has been precomputed by splatting procedure: integrating the viewing ray.
Thus, it is the screen, i.e., pixels that shoot lines.
In the implementation of 3DGS, the splatting process is omitted, since the opacity of each Gaussian is accquired by optimizing it iteratively.
Splatting is just one of the ways to get the opacity. As long as the opacities are obtained, any rendering method can be applied to form an image, e.g., rasterization, ray marching/tracing.
-
Alpha Blending
(2024-01-05)
EWA splatting equation for N kernels existing in the space:
$$ \underset{\substack{โ\\ \text{Pixel}\\ \text{color}}}{C} = โ_{kโN} \underset{\substack{โ\\ \text{Kernel}\\ \text{weight}}}{wโ}โ \underset{\substack{โ\\ \text{Kernel} \\ \text{color}}}{cโ}โ \underset{\substack{โ\\ \text{Accumulated} \\ \text{transmittance}}}{oโ}โ (\underset{\substack{โ\\ \text{Kernel}\\ \text{opacity}}}{qโ}โ \underset{\substack{โ\\ \text{Loss-pass}\\ \text{filter}}}{h}) (\underset{\substack{โ\\ \text{2D} \\ \text{coords}}}{๐ฑ}) $$
In 3DGS, each 3D Gassian in the object space has 4 learnable parameters:
-
Color (cโ): SH for “directional appearance component of the radiance field”
-
Opacity ($qโโ h$). It results in accu. transmittance oโ as $โ_{mโคk}(1-\text{opacity}โ)$.
-
Position ๐: determines the kernel’s weight wโ, as wโ is an evaluation of the projected kernel (2D Gaussian distribution) at the pixel.
-
Covariance matrix: the stretching matrix and rotaion matrix (quaternion) jointly determine wโ as well.
The splatting equation can be reformulated as alpha compositing used in 3DGS:
$$C = โ_{nโคN} Tโโ ฮฑโโ cโ, \text{ where } Tโ = โ_{mโคn}(1-ฮฑโ)$$
-
Alpha can be expressed with sigma, which is an exponent of e, akin to NeRF.
$$ฮฑโ = oโ โ exp(-ฯโ), \text{where } ฯโ = ยฝ\bm ฮโแต \bm ฮฃ’โปยน \bm ฮโ$$
-
oโ is a kernel’s opacity, i.e., the above qโโ h. And the negative exponential term is a scalar scaling factor. ฯโ is Mahalanobis distance.
(2024-02-16)
-
Gaussian’s opacity oโ is fixed after splatting with a specific viewing ray, so during alpha compositing, the variation in alpha among different Gaussian results from the different positions of a target rendering pixel relative to various Gaussians.
When performing alpha compositing, the alpha of a Gaussian is the Gaussian’s opacity scaled by the “probability” for the position of the rendering pixel in the Gaussian distribution.
-
-
However, the alpha value in NeRF is $ฮฑแตข = 1- exp(-ฯแตขฮดแตข)$ and serves as the opacity in $โแตขโโแดบ Tแตข ฮฑแตข cแตข$. Alpha is a converted point opacity ranging in [0,1].
-
-
In other words, alpha $ฮฑโ$ equals a kernel’s opacity oโ scaled by a weight Gโ (the above wโ), i.e., the evaluation of 2D Gaussian Gโ at the viewing pixel:
$$ฮฑโ = oโ โ Gโ, \text{where } Gโ = e^{-\frac{\bm ฮโแต \bm ฮโ}{2\bm ฮฃ’}}$$
-
oโ is the opacity (the footprint qโ) of the n-th 3D Gaussian. qโ is an integral of the Gaussian in the 3D ray space along the viewing ray: $qโ(๐ฑ) = โซrโ’(๐ฑ,xโ)dxโ$.
-
oโ will get optimized directly via gradient descent during training.
-
Gโ is a 2D Gaussian with the normalization factor omitted. Its mean and covariance matrix will get optimized.
-
ฮโ is the displacement of a pixel center from the 2D Gaussian’s mean.
-
-
Initially, the opacity of an arbitrary 3D location in the object space is considered as an expectation of the contributions from all 3D Gaussians on that location.
After โถ substituting the perspectively projected (“squashed”) kernel within the 3D ray space into the rendering equation, โท switching the sequence of integral and expectation, โธ and applying simplifying assumptions, the opacity becomes a 1-D (depth) integral along the viewing ray for each kernel in the 3D ray space.
In summary, the changes of opacity before and after perspective projection:
Aspect Original form Post-projection Venue Object space 3D ray space, or screen Intuition Opacity combination Discs stack Scope Location-wise Ellipsoid-wise Operation Expectation Integral Formula $f_c(๐ฎ) = โ_{kโคN} wโ rโ(๐ฎ)$ Footprint $qโ(๐ฑ) = โซ_{xโ=0}^L rโ’(๐ฑ,xโ) dxโ$ Basis Gauss. mixture Scene locality - Locality: 3D positions are grouped into different ellipsoids.
-
(2024-01-06) A 3D Gaussian (datapoint) in the camera space (or clip space) is “perspectively” projected (thrown) onto the screen (or the ray space, as its x2 is independtly assigned beside the screen coordinates x0,x1), and results in a 2D Gaussian (with applying the Taylor expansion to approximate the nonlinear perspective effects).
The viewing ray in camera space will be projected into the 3D ray space remaining a straight line segment (due to the linear approximation), and then the 3D line is projected onto the screen orthogonally.
Orthogonal projection is because the 3D Gaussians have already been projected onto the screen (the location has been determined as x,y divided by z and covariance matrix ๐บ’= ๐๐ ๐บ๐แต ๐แต) yielding 2D Gaussians, so each pixel is only derived from those projected kernels (2D Gaussians) that overlaps with it (“Overlapping” refer to 3DGS.), like a stack of filters in alpha compositing. That implies the alpha compositing is performed in the screen space, or the ray sapce, as the ray space is equivalent with screen (EWA paper: “the transformation from volume-data to ray space is equivalent to perspective projection.”).
With the “orthogonal correspondence” between the ray space and the screen, the ray integral (footprint function, or kernel’s opacity) in the 3D ray space becomes (??Not sure) an integral on the 2D screen plane, i.e., an integral of a 2D Gaussian.
And the ray in the 3D ray space corresponds to a line on the screen, as rays in the 3D ray space are parallel (i.e., orthogonal projection). Thus, the opacity is an
Alpha of an arbitrary point in screen (or 3D ray space) is a 2D-Gaussian mixture over all kernels.
(Not the screen, object space is opacities combination, whereas the screen space is filter stacking.)
The alpha of a 3D location is calculated in each 3D Gaussian based on the distance to the center.
And the final alpha on the location is the expectation of all the evaluations. (No location-wise alpha was calculated.)
The color of a pixel on the screen is a 2D-Gaussian mixture:
The alpha compositing process for a pixel is illustrated below:
-
Opacities (oโ,oโ,oโ) of different kernels are various-length integral along the viewing ray in the 3D ray space.
- Not sure whether the integral in 3D ray space equal the integral on screen.
-
Weight (wโ,wโ,wโ) of a kernel’s opacity is its evaluation at the pixel.
-
Alpha (ฮฑโ,ฮฑโ,ฮฑโ) of a kernel equals its opacity multiplied with its weight.
-
Accumulated transmittance (Tโ,Tโ,Tโ) equals the product of previously passed kernels’ transmittance.
-
Pixel color is the sum of each kernel’s color scaled by alpha.
There is no volume rendering as there is no sampling points on the viewing ray. Pixel is a summation of visible 2D discs (referring to 3DGS). Only alphas of the explicitly existent discs require to be computed, unlike volume rendering where every sampling location need to compute alpha.
Gradient wrt Composite
(2023-01-07)
- The dot lines in the left figure represent orthogonal correspondence, not projection.
- A pixel can only see 2D Gaussians located on its perpendicular line.
- Those visible 2D Gaussians to a pixel are sorted based on depth, and then their colors are composited from near to far with multiplying with their opacities that computed as an integral along the viewing ray.
A synthesized pixel is a weighted sum of the related (overlapping) kernels’ color in the whole space:
$$C_{pred} = โ_{nโคN} Tโโ ฮฑโโ cโ, \quad \text{where } ฮฑโ=oโโ e^{-\frac{\bm ฮโแต \bm ฮโ}{2\bm ฮฃ’}}$$
Loss:
$$L = โC_{targ} - C_{pred}โโ$$
The paper used Frobenius product to analyze.
A Frobenius inner product is like a linear layer:
$$โจ๐,๐โฉ = \operatorname{vec}(๐)แต \operatorname{vec}(๐)$$
|
|
Chain rule:
$$\begin{aligned} \frac{โf}{โx} &= \frac{โf}{โX}โ \frac{โX}{โAY}โ \frac{โAY}{โx} \\ &= \frac{โf}{โX}โ \frac{โX}{โAY}โ (\frac{โA}{โx}Y + A\frac{โY}{โx})\\ \end{aligned}$$
Since the passed kernels on the ray path (starting from a pixel) have influences on the next kernel’s contribution, which is scaled by the previous accumulated transmittance, the order of solving derivatives should start from the most rear kernel, and then sequentially calculate the derivatives of front kernels in the reverse order of the forward pass.
-
The viewing ray travels from back to front and hits the screen. But the color nearest to the screen is the first to be seen by (or shown on) the camera (or eye).
-
The toppest color is based on downstream colors, so, its a function of all the preceding colors.In the color stack, the color above depends on color below. Thus, the derivatives at the bottom should be sovled first.(2024-01-16) The toppest color is the base of all downstream colors, so, its derivative is contributed by all the behind colors.
Color, Opacity
(2024-01-08)
Given $\frac{โL}{โCแตข(k)}$, the partial derivatives of the predicting pixel color $Cแตข$ w.r.t. each parameter of a Gaussian Gโ (in the ray space) that contributes to the pixel are:
-
The parital derivative of Cแตข w.r.t. the kernel Gโ’s color cโ, based on the forward pass: $Cแตข = Tโโ ฮฑโโ cโ+ Tโโ ฮฑโโ cโ + โฏ Tโโ ฮฑโโ cโ+ Tโโโโ ฮฑโโโโ cโโโ +โฏ + T_Nโ ฮฑ_Nโ c_N$:
$$\frac{โCแตข(k)}{โcโ(k)} = ฮฑโโ Tโ$$
-
k represents one channel of RGB.
-
The furthest $T_N$ from the screen is saved at the end of the forward pass. And then the $T_{N-1}$ in front of it is calculated as $T_{N-1} = \frac{T_N}{1-ฮฑ_{N-1}}$. The points in front follow this relation.
1 2T = T / (1.f - alpha); const float dchannel_dcolor = alpha * T; -
-
Alpha ฮฑโ
To solve the partial derivative of pixel color $C_i$ w.r.t. the kernel Gโ’s ฮฑโ, only consider the kernels that follow Gโ, as the transmittances of all the subsequent kernels rely on the currenct kernel Gโ: $$Tโโโ = (1-ฮฑโ)Tโ$$
Thereby, the behind kernels will provide derivatives to the current kernel Gโ’s alpha.
For example, the color of the next kernel, Gโโโ, behind Gโ is:
$$\begin{aligned} Cโโโ &= cโโโโ ฮฑโโโโ Tโโโ \\ &= cโโโโ ฮฑโโโโ (1-ฮฑโ)Tโ \\ &= cโโโโ ฮฑโโโโ (1-ฮฑโ)โ \frac{ Tโโโ}{1-ฮฑโ} \\ \end{aligned}$$
Thus, the color $Cโโโ$ contributes to the total partial derivative $\frac{โCแตข}{โฮฑโ}$ with the amount: $-\frac{cโโโโ ฮฑโโโโ Tโโโ}{1-ฮฑโ}$ .
Continuously, the following color Cโโโ can be represented with ฮฑโ:
$$ Cโโโ = cโโโโ ฮฑโโโโ Tโโโ \\ = cโโโโ ฮฑโโโ โ \cancel{(1-ฮฑโโโ)} (1-ฮฑโ) \frac{Tโโโ}{ \cancel{(1-ฮฑโโโ)} (1-ฮฑโ)}$$
- Thus, $\frac{โCโโโ}{โฮฑโ} = -\frac{cโโโโ ฮฑโโโโ Tโโโ}{1-ฮฑโ}$
Similarly, the subsequent kernel Gโ, with m>n, also contribute to the overall partial derivative $\frac{โCแตข}{โฮฑโ}$.
Thereby, the ultimate partial derivatives of the pixel color $Cแตข$ w.r.t. the Gโ’s alpha ฮฑโ is:
$$\frac{โCแตข}{โฮฑโ} = cโโ Tโ - \frac{โ_{m>n}cโโ ฮฑโโ Tโ}{1-ฮฑโ}$$
-
Opacity oโ, mean ๐’, and covariance ๐บ':
According to $ฮฑโ = oโ e^{-ฯโ}$, where $ฯโ = \frac{\bm ฮโแต \bm ฮโ}{2\bm ฮฃโ’}$ (a scalar), and ๐ซโ is the offset from the pixel center to the 2D Gaussian Gโ’s mean $\bm ฮผ’$, such as $\bm ฮโ = \bm ฮผโ’ - ๐ฑแตข$.
-
Partial derivative of ฮฑโ w.r.t. opacity oโ:
$$ \frac{โฮฑโ}{โoโ} = e^{-\frac{\bm ฮโแต \bm ฮโ}{2\bm ฮฃโ’}} $$
-
Partial derivative of ฮฑโ w.r.t. the “exponent” sigma $ฯโ$:
$$ \frac{โฮฑโ}{โฯโ} = -oโ e^{-ฯโ}$$
-
Partial derivative of sigma ฯโ w.r.t. 2D mean ๐โ':
Because $\bm ฮโ$ is a function of ๐โ’, computing derivaties w.r.t. ๐โ’ is equivalent to $\bm ฮโ$.
The Jacobian of ฯโ is:
$$\frac{โฯโ}{โ\bm ฮผโ’} = \frac{โฯโ}{โ\bm ฮโ’} = \frac{โ(ยฝ\bm ฮโ’แต \bm ฮฃ’โปยน \bm ฮโ’)}{โ\bm ฮโ’} = \bm ฮฃโ’โปยน \bm ฮโ’$$
-
Partial derivative of sigma ฯโ w.r.t. 2D covariance matrix ๐บ':
$$\frac{โฯโ}{โ\bm ฮฃโ’} =$$
-
Gradient wrt Projection
(2024-01-10)
A 3D Gaussian distribution centered at ๐โ with a covariance matrix ๐บโ in the object space will be projected onto 2D screen through the splatting step, resulting in a 2D Gaussian distribution, with applying the affine transformation approximation for the nonlinear perspective division.
The 3D ray space (or the screen) is constructed based on the perspective division (x,y divided by z), which however is non-linear. Therefore, the projective transformation $ฯ(๐ญ’)$ (i.e., perspective division + new depth) converting the clip space to the 3D ray space is approximated by a linear mapping: $ฯโ(๐ญ’) = ฯ(๐ญโ’) + ๐โโ (๐ญ’ - ๐ญโ’)$, where ๐ญโ’ = ๐โแถ, the mean vector in clip space.
The effects of the approximated affine mapping ฯโ(๐ญ’) are as follows:
-
The transformed 2D Gaussian’s center ๐โ’ is the exact projective transformation ฯ(๐โแถ), i.e., ฯโ(๐ญโ’)=ฯ(๐ญโ’), without any error, with the 3rd dimension omitted.
$$ \underset{(3D)}{\bm ฮผโ’} = ฯ(\bm ฮผโแถ) = \begin{bmatrix} f_xโ ฮผ_{k,x}แถ/ฮผ_{k,z}แถ + c_x \\ f_yโ ฮผ_{k,y}แถ/ฮผ_{k,z}แถ + c_y \\ \sqrt{ {ฮผ_{k,x}แถ}^2 + {ฮผ_{k,y}แถ}^2 + {ฮผ_{k,z}แถ}^2} \end{bmatrix} \overset{\text{Omit 3rd dim}}{\longrightarrow} \underset{(2D)}{\bm ฮผโ’}= \begin{bmatrix} \frac{f_xโ ฮผ_{k,x}แถ}{ฮผ_{k,z}แถ} + c_x \\ \frac{f_yโ ฮผ_{k,y}แถ}{ฮผ_{k,z}แถ} + c_y \end{bmatrix} $$
-
However, the approximated transformation ฯโ(๐ญ’) of an arbitrary point ๐ญ’ around the clip-space 3D Gaussian center ๐ญโ' will deviate from the precise perspective projections ฯ(๐ญ’) gradually, as the (๐ญ’ - ๐ญโ’) increases in the approximated mapping:
$$ \begin{aligned} ฯ(๐ญ’) โ ฯโ(๐ญ’) &= ฯ(๐ญโ’) + ๐โโ (๐ญ’ - ๐ญโ’) \\ &= \bm ฮผโ’ + ๐โโ (๐ญ’ - ๐ญโ’) \end{aligned} $$
-
The projected 2x2 covariance matrix on screen is the 3x3 matrix in the ray space: $\bm ฮฃโ’ = ๐โโ ๐_{w2c}โ \bm ฮฃโโ ๐_{w2c}แตโ ๐โแต$, with the 3rd row and column omitted.
(2024-01-11)
โญNote: The following $๐ญโ$ is the coordinates of a Gaussian center in the camera space: $$๐ญโ = [^{๐_{w2c} \ ๐ญ_{w2c}}_{0 \quad\ 1}] โ [^{\bm ฮผโ}_1]$$
-
where ๐โยณแฝยน is the coordinates of the mean vector in world space.
-
$๐ญ’$ is the clip coordinates, which is camera coordinates times the projection matrix ๐: $๐ญ’ = ๐๐ญ$
-
๐ maps the camera-space coordinates to camera film (homogeneous) and scales for letting the ND Coordinates of points located within the camera frustum range in [-1,1]. With using clip coordinates, points whose w (i.e., z) is smaller than x,y,z are deleted.
-
The approximated projective transformation $ฯโ(๐ญ’)$ fulfills perspective division after frustum clipping. Therefore, the 2D screen coordinates $\bm ฮผโ’_{(2D)} = ฯโ(๐ญโ’)_{(2D)}$ are NDC โ [-1,1].
-
Then the Gaussian center’s ND coordinates are scaled back to the screen size, yielding pixel coordinates ๐โ’โโแตขโโ represented with clip coordinates as:
$$ \bm ฮผโ’_{(\text{pix})} = \begin{bmatrix} (Wโ t_{k,x}’/t_{k,w}’ + 1) /2 + c_x \\ (Hโ t_{k,y}’/t_{k,w}’ + 1) / 2 + c_y \end{bmatrix} $$
This relationship enables propagating gradients from pixel coordinates $\bm ฮผโ’โโแตขโโ$ to the clip coordinates $๐ญโ’$ directly, without the 2D screen coordinates $\bm ฮผโ’_{(2D)}$ involved.
Center
(2024-01-12)
Because ๐โ’โโแตขโโ and ๐บโ’ on the 2D screen both are functions of 3D Gaussian center ๐ญโ, the partial derivatives of the loss w.r.t. ๐ญโยณแฝยน is a sum:
$$ \frac{โL}{โ๐ญโ} = \frac{โL}{โ \bm ฮผโ’โโแตขโโ} \frac{โ\bm ฮผโ’โโแตขโโ}{โ๐ญโ} + \frac{โL}{โ\bm ฮฃโ’_{(2D)}} \frac{โ\bm ฮฃโ’_{(2D)}}{โ๐ญโ} \\ $$
-
The partial derivative of 2D Gaussian’s mean $\bm ฮผโ’โโแตขโโ$ w.r.t. the camera coordinates of 3D Gaussian center ๐ญโ:
$$ \begin{aligned} \frac{โ\bm ฮผโ’โโแตขโโ}{โ๐ญโ} &= \frac{โ\bm ฮผโ’โโแตขโโ}{โ\bm ฮผโ’_{(2D)}} โ \frac{โฯโ(๐ญโ’)_{(2D)}}{๐ญโ’} โ \frac{โ๐ญโ’}{โ๐ญโ} \qquad \text{(Full process)} \\ &= \frac{โ\bm ฮผโ’โโแตขโโ}{โ๐ญโ’} โ \frac{โ๐ญโ’}{โ๐ญโ} \qquad \text{(ClipโPix, skip screen coords)} \\ &= \frac{1}{2} \begin{bmatrix} W/t_{k,w}’ & 0 & 0 & -Wโ t_{k,x}’/{t_{k,w}’}^2 \\ 0 & H/t_{k,w}’ & 0 & -Hโ t_{k,y}’/{t_{k,w}’}^2 \end{bmatrix}โ ๐ \end{aligned} $$
Based on the properties of the Frobenius inner product, eq. (23) is obtained.
-
The partial derivative of the 2D Gaussian’s covariance ๐บโ’ w.r.t. the camera coordinates of 3D Gaussian center ๐ญโ:
$$ \frac{โ\bm ฮฃโ’}{โ๐ญโ} = \frac{โ(๐โโ ๐_{w2c}โ \bm ฮฃโโ ๐_{w2c}แตโ ๐โแต)_{2D} }{โ๐ญโ} $$
(2024-01-13) Derivation refers to 3D Gaussian Splattingไธญ็ๆฐๅญฆๆจๅฏผ - ๅ ซๆฐจๅๆฐฏๅ้็ๆ็ซ - ็ฅไน
-
Letting $๐ = ๐โโ ๐_{w2c}$ (3DGS code refers to it as
T.), the Gaussian covariance ๐บโ’ in the 3D ray space derived from the projective transformation ฯโ(๐ญ’) is:$$\bm ฮฃโ’ = ๐ โ \bm ฮฃโโ ๐แต = \\ \begin{bmatrix} Uโโ & Uโโ & Uโโ \\ Uโโ & Uโโ & Uโโ \\ Uโโ & Uโโ & Uโโ \end{bmatrix} \begin{bmatrix} ฯโโ & ฯโโ & ฯโโ \\ ฯโโ & ฯโโ & ฯโโ \\ ฯโโ & ฯโโ & ฯโโ \end{bmatrix} \begin{bmatrix} Uโโ & Uโโ & Uโโ \\ Uโโ & Uโโ & Uโโ \\ Uโโ & Uโโ & Uโโ \end{bmatrix} = \\ \begin{bmatrix} Uโโ & Uโโ & Uโโ \\ Uโโ & Uโโ & Uโโ \\ Uโโ & Uโโ & Uโโ \end{bmatrix} \begin{bmatrix} \boxed{ฯโโ}Uโโ+ฯโโUโโ+ฯโโUโโ & \boxed{ฯโโ}Uโโ+ฯโโUโโ+ฯโโUโโ & \boxed{ฯโโ}Uโโ+ฯโโUโโ+ฯโโUโโ \\ ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ \\ ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ \end{bmatrix} = \\ \Big[ \begin{array}{c|c|c} Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) & Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) & Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) \\ Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) & Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) & Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) \\ Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) & Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) & Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) \end{array} \Big] $$
The 3rd row and column in ๐บโ’ are omitted due to the orthogonal correspondence between the 3D ray space and the 2D screen. Thus, $\bm ฮฃโ’_{(2D)}$ is only the upper-left 2ร2 elements of the 3D ๐บโ’, contributing to the gradient of 2D loss $L(๐โ, oโ, \bm ฮผโ’_{(2D)}, \bm ฮฃโ’_{(2D)})$, while the remaining 5 elements of ๐บโ’โโโโโ make no contributions.
$$ \bm ฮฃโ’_{(2D)} = \\ \Big[ \begin{array}{c|c} Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) & Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) \\ Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) & Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) + Uโโ(ฯโโUโโ+ฯโโUโโ+ฯโโUโโ) \end{array} \Big] $$
-
Each element of $\bm ฮฃโ’_{(2D)}$ is a “sub-” function, which is taken derivative w.r.t. each variable: ฯโโ, ฯโโ, ฯโโ, ฯโโ, ฯโโ, ฯโโ, to backpropagate the gradient $\frac{โL}{โ\bm ฮฃโ’_{(2D)}}$ to ๐บโ. (Only these 6 elements of ๐บโ need computation as ๐บโโโโโโ is symmetric.)
-
It’s not proper to think of the derivative of a “matrix” w.r.t. a matrix. Instead, it’s better to consider the derivative of a function w.r.t. variables, as essentially a matrix stands for a linear transformation.
The partial derivative of $\bm ฮฃโ’_{(2D)}$ w.r.t. $\bm ฮฃโ$ (the 3D covariance matrix in world space):
$$ \frac{โ\bm ฮฃโ’_{(2D)}}{โฯโโ} = \begin{bmatrix} UโโUโโ & UโโUโโ \\ UโโUโโ & UโโUโโ \end{bmatrix} = \begin{bmatrix} Uโโ \\ Uโโ \end{bmatrix} \begin{bmatrix} Uโโ & Uโโ \end{bmatrix} \\ \frac{โ\bm ฮฃโ’_{(2D)}}{โฯโโ} = \begin{bmatrix} UโโUโโ & UโโUโโ \\ UโโUโโ & UโโUโโ \end{bmatrix} = \begin{bmatrix} Uโโ \\ Uโโ \end{bmatrix} \begin{bmatrix} Uโโ & Uโโ \end{bmatrix} \\ \frac{โ\bm ฮฃโ’_{(2D)}}{โฯโโ} = \begin{bmatrix} UโโUโโ & UโโUโโ \\ UโโUโโ & UโโUโโ \end{bmatrix} = \begin{bmatrix} Uโโ \\ Uโโ \end{bmatrix} \begin{bmatrix} Uโโ & Uโโ \end{bmatrix} \\ \frac{โ\bm ฮฃโ’_{(2D)}}{โฯโโ} = \begin{bmatrix} UโโUโโ & UโโUโโ \\ UโโUโโ & UโโUโโ \end{bmatrix} = \begin{bmatrix} Uโโ \\ Uโโ \end{bmatrix} \begin{bmatrix} Uโโ & Uโโ \end{bmatrix} \\ \frac{โ\bm ฮฃโ’_{(2D)}}{โฯโโ} = \begin{bmatrix} UโโUโโ & UโโUโโ \\ UโโUโโ & UโโUโโ \end{bmatrix} = \begin{bmatrix} Uโโ \\ Uโโ \end{bmatrix} \begin{bmatrix} Uโโ & Uโโ \end{bmatrix} \\ \frac{โ\bm ฮฃโ’_{(2D)}}{โฯโโ} = \begin{bmatrix} UโโUโโ & UโโUโโ \\ UโโUโโ & UโโUโโ \end{bmatrix} = \begin{bmatrix} Uโโ \\ Uโโ \end{bmatrix} \begin{bmatrix} Uโโ & Uโโ \end{bmatrix} $$
- ฯโโ, ฯโโ, ฯโโ are on the diagonal, while ฯโโ, ฯโโ, ฯโโ are off-diagonal.
The partial derivative of the loss L w.r.t. each element of ๐บโ:
$$ \begin{aligned} \frac{โL}{โ\bm ฮฃโ’_{(2D)}} \frac{โ\bm ฮฃโ’_{(2D)}}{โฯโโ} &= โ_{row}โ_{col}{ \begin{bmatrix} \frac{โL}{โa} & \frac{โL}{โb} \\ \frac{โL}{โb} & \frac{โL}{โc} \end{bmatrix} โ \begin{bmatrix} UโโUโโ & UโโUโโ \\ UโโUโโ & UโโUโโ \end{bmatrix} } \\ &= \frac{โL}{โa} UโโUโโ + 2ร \frac{โL}{โb} UโโUโโ + \frac{โL}{โc} UโโUโโ \\ \frac{โL}{โ\bm ฮฃโ’_{(2D)}} \frac{โ\bm ฮฃโ’_{(2D)}}{โฯโโ} &= โ_{row}โ_{col}{ \begin{bmatrix} \frac{โL}{โa} & \frac{โL}{โb} \\ \frac{โL}{โb} & \frac{โL}{โc} \end{bmatrix} โ \begin{bmatrix} UโโUโโ & UโโUโโ \\ UโโUโโ & UโโUโโ \end{bmatrix} } \\ &= \frac{โL}{โa} UโโUโโ + \frac{โL}{โb} UโโUโโ +\frac{โL}{โb}UโโUโโ + \frac{โL}{โc} UโโUโโ \end{aligned} $$
-
โ is Hadamard product (element-wise product). $โ_{row}โ_{col}$ means summation of all elements in the matrix.
(2024-02-17) In this step, the derivative w.r.t. a matrix is determined by calculating the derivative w.r.t. each element individually, rather than the entire matrix. Thus, the multiplication between two “derivative matrices” is hadamard product, as essentially it’s the derivative w.r.t. a single scalar (in contrast to vector or matrix). However, for example, if $\frac{โ\bm ฮฃ}{โ๐}$ is the derivative of ๐บ w.r.t. the matrix ๐, the multiplication with the incoming upstream “derivative matrix” should be a normal matmul.
Within a chain of differentiation, the 2 manners of solving derivative for a matrix by computing the derivative for the entire matrix or calculating the derivative for each element can coexist simultaneously.
-
Note: The $\frac{โL}{โb}$ in the 3DGS code has been doubled. And each off-diagonal element is multiplied by 2, as the symmetrical element has the same gradient contribution.
In 3DGS code, the derivative of loss w.r.t. each element of 3D covariance matrix ๐บโโโโโโ in the world space is computed individually:
1 2 3 4 5 6 7 8 9 10 11dL_da = denom2inv * (...); dL_dc = denom2inv * (...); dL_db = denom2inv * 2 * (...); dL_dcov[0] = (T[0][0]*T[0][0]*dL_da + T[0][0]*T[1][0]*dL_db + T[1][0]*T[1][0]*dL_dc); dL_dcov[3] = (T[0][1]*T[0][1]*dL_da + T[0][1]*T[1][1]*dL_db + T[1][1]*T[1][1]*dL_dc); dL_dcov[5] = (T[0][2]*T[0][2]*dL_da + T[0][2]*T[1][2]*dL_db + T[1][2]*T[1][2]*dL_dc); dL_dcov[1] = 2*T[0][0]*T[0][1]*dL_da + (T[0][0]*T[1][1] + T[0][1]*T[1][0])*dL_db + 2*T[1][0]*T[1][1]*dL_dc; dL_dcov[2] = 2*T[0][0]*T[0][2]*dL_da + (T[0][0]*T[1][2] + T[0][2]*T[1][0])*dL_db + 2*T[1][0]*T[1][2]*dL_dc; dL_dcov[4] = 2*T[0][2]*T[0][1]*dL_da + (T[0][1]*T[1][2] + T[0][2]*T[1][1])*dL_db + 2*T[1][1]*T[1][2]*dL_dc;The 2D covariance matrix $\bm ฮฃโ’_{(2D)}$ is Not equivalent to the calculation where the 3rd row and column of ๐บโ are omitted from the beginning, because ฯโโ, ฯโโ, ฯโโ are also involved in the projected covariance ๐บโ’. However, the derivatives w.r.t. them ($\frac{โ\bm ฮฃโ’_{(2D)}}{โฯโโ},\ \frac{โ\bm ฮฃโ’_{(2D)}}{โฯโโ},\ \frac{โ\bm ฮฃโ’_{(2D)}}{โฯโโ}$) can’t be derived from the following expression:
$$\begin{aligned} \bm ฮฃโ’_{(2D)} &= \begin{bmatrix} Uโโ & Uโโ \\ Uโโ & Uโโ \end{bmatrix} \begin{bmatrix} ฯโโ & ฯโโ \\ ฯโโ & ฯโโ \end{bmatrix} \begin{bmatrix} Uโโ & Uโโ \\ Uโโ& Uโโ \end{bmatrix} \\ &= \begin{bmatrix} (Uโโฯโโ + Uโโฯโโ)Uโโ + (Uโโฯโโ + Uโโฯโโ) Uโโ & (Uโโฯโโ + Uโโฯโโ)Uโโ + (Uโโฯโโ + Uโโฯโโ) Uโโ \\ (Uโโฯโโ + Uโโฯโโ)Uโโ + (Uโโฯโโ + Uโโฯโโ) Uโโ & (Uโโฯโโ + Uโโฯโโ)Uโโ + (Uโโฯโโ + Uโโฯโโ) Uโโ \end{bmatrix} \end{aligned} $$
-
-
The partial derivative of the 2D Gaussian covariance matrix $\bm ฮฃโ’_{(2D)}$ w.r.t. the 3D Gaussian center ๐ญโ in the camera space:
$$ \begin{aligned} \frac{โ\bm ฮฃโ’_{(2D)}}{โ๐โ} \frac{โ๐โ}{โ๐ญโ} = \frac{โ๐โ {\bm ฮฃโ}_{(3D)} ๐โแต}{โ๐โ} \frac{โ(๐โโ ๐_{w2c})}{โ๐ญโ} \\ \end{aligned} $$
-
$\frac{โ๐โ {\bm ฮฃโ}_{(3D)} ๐โแต}{โ๐โ}$ (corresponding to $\frac{โ\bm ฮฃโ’}{โT}$ inside $\frac{โL}{โT}$ of eq.(25) in the gsplat paper.)
$$ \begin{aligned} \frac{โ๐โ {\bm ฮฃโ}_{(3D)} ๐โแต}{โUโโ} &= \begin{bmatrix} 2ฯโโUโโ+(ฯโโ+ฯโโ)Uโโ+(ฯโโ+ฯโโ)Uโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ \\ ฯโโUโโ + ฯโโUโโ + ฯโโUโโ & 0 & 0 \\ ฯโโUโโ + ฯโโUโโ + ฯโโUโโ & 0 & 0 \end{bmatrix} \\ \frac{โ๐โ {\bm ฮฃโ}_{(3D)} ๐โแต}{โUโโ} &= \begin{bmatrix} Uโโฯโโ+ฯโโUโโ+2ฯโโUโโ+ฯโโUโโ +Uโโฯโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ \\ Uโโฯโโ + Uโโฯโโ+Uโโฯโโ & 0 & 0 \\ Uโโฯโโ + Uโโฯโโ+Uโโฯโโ & 0 & 0 \end{bmatrix} \\ \frac{โ๐โ {\bm ฮฃโ}_{(3D)} ๐โแต}{โUโโ} &= \begin{bmatrix} Uโโฯโโ+Uโโฯโโ+ฯโโUโโ+ฯโโUโโ+2ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ \\ Uโโฯโโ+Uโโฯโโ+Uโโฯโโ & 0 & 0 \\ Uโโฯโโ+Uโโฯโโ+Uโโฯโโ & 0 & 0 \end{bmatrix} \\ \\ \frac{โ๐โ {\bm ฮฃโ}_{(3D)} ๐โแต}{โUโโ} &= \begin{bmatrix} 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ & 0 \\ ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & 2ฯโโUโโ+ฯโโUโโ+ฯโโUโโ+Uโโฯโโ+Uโโฯโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ \\ 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ & 0 \end{bmatrix} \\ \frac{โ๐โ {\bm ฮฃโ}_{(3D)} ๐โแต}{โUโโ} &= \begin{bmatrix} 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ & 0 \\ ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & Uโโฯโโ+ฯโโUโโ+2ฯโโUโโ+ฯโโUโโ+Uโโฯโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ \\ 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ & 0 \end{bmatrix} \\ \frac{โ๐โ {\bm ฮฃโ}_{(3D)} ๐โแต}{โUโโ} &= \begin{bmatrix} 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ & 0 \\ ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & Uโโฯโโ+Uโโฯโโ+ฯโโUโโ+ฯโโUโโ+2ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ \\ 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ & 0 \end{bmatrix} \\ \\ \frac{โ๐โ {\bm ฮฃโ}_{(3D)} ๐โแต}{โUโโ} &= \begin{bmatrix} 0 & 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ \\ 0 & 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ \\ ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & 2ฯโโUโโ+ฯโโUโโ+ฯโโUโโ+ Uโโฯโโ + Uโโฯโโ \\ \end{bmatrix} \\ \frac{โ๐โ {\bm ฮฃโ}_{(3D)} ๐โแต}{โUโโ} &= \begin{bmatrix} 0 & 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ \\ 0 & 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ \\ ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & Uโโฯโโ+ ฯโโUโโ+2ฯโโUโโ+ฯโโUโโ+ Uโโฯโโ \\ \end{bmatrix} \\ \frac{โ๐โ {\bm ฮฃโ}_{(3D)} ๐โแต}{โUโโ} &= \begin{bmatrix} 0 & 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ \\ 0 & 0 & Uโโฯโโ+Uโโฯโโ+Uโโฯโโ \\ ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & ฯโโUโโ+ฯโโUโโ+ฯโโUโโ & Uโโฯโโ + Uโโฯโโ + ฯโโUโโ+ฯโโUโโ+2ฯโโUโโ \\ \end{bmatrix} \end{aligned} $$
-
$\frac{โ๐โ}{โ๐ญโ} = \frac{โ(๐โโ ๐_{w2c})}{โ๐ญโ} = \frac{โ๐โ}{โ๐ญโ}โ ๐_{w2c} + \cancel{ ๐โโ \frac{โ๐_{w2c}}{โ๐ญโ} }$,
(2024-01-16)
The derivative of ๐โ w.r.t. the camera-space center ๐ญโ could be obtained from the representation of ๐โ in terms of ๐ญโ, which includes focals fx,fy more than the representation with clip coordinates ๐ญโ’. In this way, the projection matrix P isn’t involved as ๐ญโ’=๐๐ญโ.
$$ ๐_{๐ญโ} = \begin{bmatrix} fโ/t_{k,z} & 0 & -fโ t_{k,x} / {t_{k,z}}^2 \\ 0 & f_y/t_{k,z} & -f_y t_{k,y} / {t_{k,z}}^2 \\ t_{k,x}/โ๐ญโโ & t_{k,y}/โ๐ญโโ & t_{k,z}/โ๐ญโโ \end{bmatrix} $$
The derivative of $๐_{๐ญโ}$ w.r.t. each component of ๐ญโ:
$$ \begin{aligned} \frac{โ๐_{๐ญโ}}{โt_{k,x}} = \begin{bmatrix} 0 & 0 & -f_x/t_{k,z}ยฒ \\ 0 & 0 & 0 \\ 1/โ๐ญโโ - t_{k,x}^2/โ๐ญโโ^3 & -t_{k,y}t_{k,x}/โ๐ญโโ^3 & -t_{k,z}t_{k,x}/โ๐ญโโ^3 \end{bmatrix} \\ \frac{โ๐_{๐ญโ}}{โt_{k,y}} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & -f_y/t_{k,z}^2 \\ -t_{k,x}t_{k,y}/โ๐ญโโ^3 & 1/โ๐ญโโ - t_{k,y}^2/โ๐ญโโ^3 & -t_{k,z}t_{k,y}/โ๐ญโโ^3 \end{bmatrix} \\ \frac{โ๐_{๐ญโ}}{โt_{k,z}} = \begin{bmatrix} -f_x/t_{k,z}^2 & 0 & 2 f_x t_{k,x}/t_{k,z}^3 \\ 0 & -f_y/t_{k,z}^2 & 2 f_y t_{k,y}/t_{k,z}^3 \\ -t_{k,x}t_{k,z}/โ๐ญโโ^3 & -t_{k,y}t_{k,z}/โ๐ญโโ^3 & 1/โ๐ญโโ - t_{k,z}^2/โ๐ญโโ^3 \end{bmatrix} \end{aligned} $$
-
The derivative of ๐บโ’ w.r.t. ๐ญโ
(2024-01-17)
$$ \begin{aligned} \frac{โ\bm ฮฃโ’_{(2D)}}{โ๐โ} \frac{โ๐โ}{โt_{k,x}} \end{aligned} $$
-
-
Based on the projective projection $ฯ(๐ญ) โ ฯ(๐ญโ) + ๐โโ (๐ญ - ๐ญโ)$,
where
-
The extrinsics of camera: $๐_{w2c} = [^{๐_{w2c} \ ๐ญ_{w2c}}_{0 \quad\ 1}]$
-
๐ญ is the mean vector represented in the camera space: $๐ญ = ๐_{w2c} โ [^{\bm ฮผโ}_1]$
-
The Jacobian of the projective transformation evaluated at ๐โ:
$$๐โ = \begin{bmatrix} fโ/ฮผ_{k,z} & 0 & -fโ ฮผโ / ฮผ_{k,z}^2 \\ 0 & f_y/ฮผ_{k,z} & -f_y ฮผ_y / ฮผ_{k,z}^2 \\ ฮผโโ/โ\bm ฮผโโโ & ฮผ_{k,y}/โ\bm ฮผโโโ & ฮผ_{k,z}/โ\bm ฮผโโโ \end{bmatrix} $$
Therefore, the partial derivatives of the loss w.r.t.
-
The partial derivative of the loss ๐ w.r.t. the 3D Gaussian center ๐ญ in the world space:
$$ \frac{โL}{โ๐ญ} = \frac{โL}{โ\bm ฮผ’} \frac{โ\bm ฮผ’}{โ๐ญ} + \frac{โL}{โ\bm ฮฃโ’_{(2D)}} \frac{โ\bm ฮฃโ’_{2D}}{โ๐ญ} $$
-
The partial derivative of the loss ๐ w.r.t. the 3D Gaussian covariance matrix ๐บโ in the world space:
$$ \frac{โL}{โ\bm ฮฃโ} = \frac{โL}{โ\bm ฮฃโ’_{(2D)}} \frac{โ\bm ฮฃโ’_{(2D)}}{โ\bm ฮฃโ} $$
Covariance
Because covariance matrix is symmetric (๐บโ = ๐บโแต), it’s a square matrix, so it’s diagonalizable
“Diagonalizable matrix ๐ can be represented as: ๐ = ๐๐๐โปยน.”
“A diagonalizable matrix ๐ may (?) be decomposed as ๐=๐๐ฒ๐แต”
“Quadratic form can be regarded as a generalization of conic sections.” Symmetric matrix
Since the covariance matrix ๐บ is a symmetric matrix, its eigenvalues are all real. By arranging all its eigenvectors and eigenvalues into matrices, there is:
$$\bm ฮฃ ๐ = ๐ ๐$$
-
where each column in ๐ is an eigenvector, which are orthogonal to each other.
-
๐ is a diagonal matrix. For example:
$$ \bm ฮฃ ๐ = ๐ ๐ = \begin{bmatrix} a & d & g \\ b & e & h \\ c & f & j \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 3 \end{bmatrix} = \begin{bmatrix} a & 2d & 3g \\ b & 2e & 3h \\ c & 2f & 3j \end{bmatrix} $$
-
If ๐ is invertible, ๐บ can be represented as ๐บ=๐๐๐โปยน
The eigenvectors matrix ๐ and eigenvalues matrix ๐ corresponds to the rotation matrix ๐ and the stretching matrix ๐ squared, which are solved from SVD: ๐บ=๐๐๐แต๐แต.
The rotation matrix rotates the original space to a new basis, where each axis points in the direction of the highest variance, i.e., the eigenvectors. And the stretching matrix indicates the magnitude of variance along each axis, i.e., the square root of the eigenvalues $\sqrt{๐}$. janakiev-Blog
After obtaining the magnitude of the variance in each direction, the extent range on each axis can be calculated based on the standard deviation according to the 3-sigma rule.
To optimize the 3D Gaussians in the world space based on rendered image, the derivative chain is like: