read: Image as Set of Points

Code | Arxiv(2303) | ProjPage

Briefs

Author’s Talk

Source Video: 【Talk | ICLR'23 Oral 美国东北大学马旭：图像亦是点集（Image as Set of Points）】

Insights:

Use clustering to extract image features.
Fusing cluster members through Collapse followed by Reconstruction

Paper Explain

Source Video: 【ICLR 2023】Image as Set of Points.计算机视觉新范式，利用聚类的思想实现图像建模。在多个下游任务上不输ViT和ConvNets】

Image features are refined through multiple aggregation and dispatching using attention for pixels.

Paper Notes

Abstract

Is the picture natively well-clustered?
- If so, this method essentially is same as convolution, which extracts features by fusing pixels in the kernel.
  
  But Cluster is discreate and sparse, can it capture high-level feature?
Clusters appeals to segmentation and interpretability.

Intro

Is this method designed for a single image?

However, MVS can directly leverage epipolar lines on source views more concisely, without fusing neighbor pixels features..
Pixels with same color can imply fundamentally difference?

“Similar pixels are grouped, but they are fundamentally different.” ?

I think same color means rays hit same spatial geometry.

Method

Pipeline:

Initial feature of raw pixels are 5-D, including color (r,g,b) ∈ [0,255] and normalized pixel center (x,y) ∈ [0, 1]-0.5
Pixels are reduced to 𝑟 times at each stage’s start by fusing k (=4 or 9) neighbors around the 𝑟 evenly distributed anchors across the image, i.e., concatenating their channels and projecting to specified dimension by FC, unlike maxpooling.
- Conv layer can perform points reduction as well if points’ organization aligned with the raw image.
- In their implementation, function PointReducer uses a conv layer instead of FC.
  
  For model coc_base_dim64, before stage-0, image is reduced to 1/16 with kernels of size=4, i.e., 16 points are fused to 1. and 5-D features are mixed to 64-D.
  1 2
  
  # Point Reducer is implemented by a layer of conv since it is mathmatically equal. nn.Conv2d(5, 64, kernel_size=4, stride=4, padding=0)
Output points require reording for downstream pixel-wise tasks, like segmentation.

CoC Block:

Each stage repeats Context Cluster block several time. A blocks processes points set in 3 steps:

Pixels clustered $c$ groups in the feature space
- c centers are evenly distributed and form voronoi based on the cosine similarity of each point to each center feature.
- Centers are chose through nn.AdaptiveAvgPool2d((proposal_w, proposal_h)), which will set kernel and stride automatically for AvgPool2d.
- Center feature is the average of k nearest neighbors, after placing the c centers.
- Intial features are 5-D including RGB and position (x,y).
Aggregation: Add features of all members to an aggregated feature g:
$$g = \frac{1}{C} (v_c + ∑_{i=1}^m sig(α s_i + β) * v_i )$$
The aggregated feature g is computed by plusing center’s feature and the weighted sum of feature vectors of all m points vᵢ in a cluster, scaled by a tunable factor $sig(α sᵢ + β)$ ∈ (0,1), where s is similarity to the center feature and α,β are nn.parameters.

The denominator $C = 1+ ∑_{i=1}^m sig(α sᵢ+ β)$ aims to limit the magnitude.
Dispatching: Each member updates its feature from the aggregated feature, so as to fuse all other points and realize spatial interaction.
$$p_i' = p_i + FC(sig (α sᵢ+ β) * g )$$
The amount of g assigned to a member is determined by the adaptive similarity again, inversing the summation.

img-zhihu

Communication between pixel in a cluster is like server-client in a centralized network.
Centers’s positions are fixed for efficiency, so it emphasizes locality.
doubt: Advanced postional embedding could be applied.
doubt: Will different selection strategies affect model performance? They mentioned Farthest Point Sampling (FPS) mehtod in appx.D

Architecture:

Context Cluster is a hierarchical model composed 4 stages and points are reduced to 1/4 (ie, h/2, w/2) after each stage.

img-arch

Play

Model can be comprehended by debugging the file “context_cluster”, using environment of “AIM”.

\begin{algorithm}
\begin{algorithmic}
\STATE PointReducer: Conv2d(x), downsample 16 times, 256 dim
\STATE Partition feature maps: rearrange(x)
\STATE Centers feature from x: AdaptiveAvgPool2d((2,2))(x)
\STATE Simlarity matrix: vectors' inner product with multi-head
\STATE Clustering: .scatter\_
\STATE Aggregate feature $g$: sum members' feat based on similarity
\STATE Dispatch $g$ to members
\STATE Reverse partition
\STATE Project to out\_dim: Conv2d
\STATE FFN: Mlp, out\_dim → hidden → out\_dim
\end{algorithmic}
\end{algorithm}

Similarity and points’ features are optimized separately:

Similarity: x → center → sim
Features: x → value → val_center → aggregated feature out

sim and out are decoupled.