sympo: Structural Reparameterization

(2023-10-15)

Talk-221121

Source video: 丁霄汉：结构重参数化是怎么来的【深度学习】【直播回放】-bilibili

RepVGG: (2021)
1. VGG has bad performance with fast inference because single stream can run in parallel efficientlly.
  
  And multiple branches means multiple sets of parameters, which help achieve better precision.
2. If a set of parameters can be transformed equivalently to another set of parameters, the corresponding structure would changed naturally.
  
  Therefore, the multi-branches architecture during training can be transformed to a single branch model in inference period.
3. Methodology: Kernel size can vary while computation remains, e.g., a 1×1 kernel can be reshaped to 3×3
  - Thus, 3 branches with a 3×3 kernel, a 1×1 kernel, and a 3×3 identity kernel can be added to a single 3x3 kernel based on the property of linearity of convolution: $x * K_a + x * K_b = x * (K_a+K_b)$
Centripetal SGD: (2017)
1. Inception: Compressing (pretrained?) models by pruning redundent channels in feature maps.
2. To create identical channels, let optimizer (SGD) guide some channels to become similar.
3. Two same channels merged to a comprehensive channel, model gets concise while performance unchanged.
Linear Redundancy Unit (Obsolete)
1. Merge 2 feature maps: Training with two 3×3 kernels and merge the 2 kernels after training. This method brought marginal improvement though.
2. This indicates that two models with the same final structure, but experienced different training processes in different architectures, have different performances.
Asymmetic Convolution Block (2019)
1. Branches are different: 3×3 + 1×3 + 3×1
Research on Simple Models: (2020)
1. Use identity branch to eliminate some shortcuts in ResNet.
  - How to make a ultimate simple yet powerful model without shortcut? ▶ RepVGG
2. Multiple branches like InceptionNet in just a single kernel.
  - Why can it work in an arbitrary model? ▶ Diverse Branch Block (DBB)
RepMLP: (2022)
1. Inject locality into MLP (CNN is a special MLP) by transforming an arbitrary conv kernel to a FC kerenel.
RepLKNet: (2022)
1. Large kernel: 31x31 + 5x5
Misc:
1. ResRep for channel pruning.
  - Output channels can be controled through a 1x1 kernel after the original 3x3 kernel. Such that channel pruning can be performed on the 1x1 kernel.
2. RepOptimizer: generalize to gradient reparameterization for fast training.
- Incorperating the prior knowledge (inductive bias) into optimizer instead of model structure.
1. RepVGGplus: principles behind RepVGG

Ideas moving forward:

Connect Structural Rep with every element in a general vision model:

Topology (RepVGG), Component (ACNet, DBB), Width (ResRep), Globality v.s. locality (RepMLP), Kernel size (RepLKNet), Optimizer (RepOptimizer)
Rethink classical problems.
- Simple model, like VGG, doesn’t work? (RepVGG)
- Can’t train a super deep model without shortcut? (RepVGGplus)
- Inception Net is too complex to be abandoned? (DBB)
- MLP can’t handle image tasks? (RepMLP)
- Large kernels are less effective? (RepLKNet)

Related works:

Non-deep Network; RepNAS, YOLO v6&v7, DyRep, Scaling up Kernels in 3D GNNs, RepUNet, RepSR (superres), De-IReps.

Talk-220426

Source video: 【论文连讲：用重参数化赋予MLP网络局部性、超大卷积核架构【CVPR2022】【基础模型】】- bilibili

(2023-10-16)

RepMLPNet

“一种采用重参数化技术引入局部性的分层 MLP 网络”

Code | Arxiv

MLP has no locality, only global capacity, thus it’s not favorable to do linear projection on 2D images.

Locality means the surrounding pixels of a input pixel should have larger contributions due to stronger correlation compared to distant pixels.
- However, MLP treats all pixels on the image equally without considering relative positions, resulting in that MLP is difficult to converge for images data due to high dimensionailty and individually training for each pixel.
- CNN perserves this inductive bias through kernels. But CNN doesn’t have long-range dependencies because different regions share the same parameters: the kernel. Thus, a CNN stacks multiple layers for a large receptive field. In contrast, MLP is a function of positions, sensitive to location.
- Hence, one approach to inject locality is by creating parallel branches with various conv kernels (for different dimensions) alongside the fc layer.
By supplementing conv kernels, the model is competent both at long-range dependency and locality for 2D images.
- The side effect is the mutiple disunified branches will hinder computation parallelism, and impair the inference efficiency finally.
- The solution to maintain the inference efficiency and perserve conv branches is Structural Reparameterization:
  
  Merging multiple auxiliary branches to a single FC stream can be realized by transforming their parameters after training into one FC kernel, such that the inference speed and precision are unchanged.
  
  通过参数的等价转换实现结构的等价转换。
Generic CNNs with conv kernels include massive parameters. And multiple branch of conv kernels may be unfeasible if without reducing parameters.
- There are three branches: FC3, 3x3 & 1x1 kernels, and “Identity”
1. Identity branch performs FC1+FC2 after maxpooling shrinks (H,W) to only (1,1).
  
  Thus, the FC layer only need 1 parameter. Plusing 4 parameters in BatchNorm (mean, std, scale factor, bias), this branch only has 5 parameters.
  
  This branch functions like a SE block (Squeeze-Excitation) providing channel-wise “overall scaling”.
2. 3x3 and 1x1 conv layer perform “set-sharing” (depth-wise conv + group convolution), where total of C channels are split to S groups.
  
  Then the number of parameters in a conv layer reduced from (C×H×W)² to S×(H×W)².
3. The main branch performs FC3 after depth-wise convolution for input feature maps.
The equivalent FC layer for a conv layer is required for adding conv layers to FC layer.
- FC kernel is the 2D weight matrix $W_{dₒ×dᵢ}$ in a linear layer.
- A 3D Conv kernel is a special FC kernel represented as a Toeplitz matrix, containing lots of shared parameters, so its associated FC kernel must exist.
  
  Then, 2 FC kernels can add up directly based on linearity.
- A FC layer processes a feature map through 4 steps: (n,c,h,w) ➔ (n, c×h×w) ➔ FC kernel ➔ (n, o×h×w) ➔ (n,o,h,w), denoted as: $\rm MMUL(featmap, W_{dₒ×dᵢ})$, where dᵢ = c×h×w.
  
  A Conv layer with a 3D conv kernel $F$ and padding $p$ processes the feature map is denoted as $\rm CONV(featmap, F,p)$
  
  Thus, the problem is how to convert a 3D kernel to a 2D kernel.
- Given the corresponding FC kernel of a conv kernel $W^{(F,p)}$, two operations are equivalent: $\rm MMUL(featmap, W^{(F,p)}) = CONV(featmap,F,p)$
- Considering a linear layer, it projects vectors: $\rm V_{n×dₒ} = V_{n×dᵢ} ⋅W^{(F,q)\ T}$
  
  Insert an identity matrix I:
  $$ V_{n×dₒ} = V_{n×dᵢ} ⋅I ⋅ W^{(F,q)\ T} = V_{n, dᵢ} ⋅(I_{dᵢ×dᵢ} ⋅ W^{(F,q)\ T}) $$
  Then, the term $(I⋅W^{(F,q)\ T})$ can be regarded as a convolution operation.
- A conv operation must be a Mat-Mul, but a Mat-Mul may not be a conv operation.
  
  What kind of Mat-Mul (FC layer) is a conv operation? It’s when the weight matrix is a Toeplitz matrix transformed from a conv kernel.
  
  Because $W^{(F,p)}$ is transformed indeed from conv kernel, the Mat-Mul $\rm I⋅W^{(F,p)}$ is a convolution operation for sure.
  $$\rm I_{dᵢ×dᵢ}⋅W^{(F,p)} ⇔ CONV(F,p,featmap)$$
  In the convolution $I_{dᵢ×dᵢ}⋅W^{(F,p)}$, $I_{dᵢ×dᵢ}$ is convoled. Thus, it’s supposed to be the featmap in CONV(). i.e., the $I_{dᵢ×dᵢ}$ is reshaped from featmap $I_{(c×h×w, c, h, w)}$
  
  Additional reshaping is needed to match the dimensionality:
  $$\rm I_{dᵢ×dᵢ}⋅W^{(F,p)} = CONV(F,p,featmap).reshape(chw, c, h, w)$$
- From the above equation, the desired FC kernel $W^{(F,p)}$ is the result feature map of convolving the kernel F with a blank featmap:
  $$\rm W^{(F,p)} = CONV(F,p,I_{(c×h×w, c, h, w)}).reshape(chw, c, h, w)$$
  For example, if the conv kernel F is (c, o, (3,3)), then the corresponding FC kernel $W^{(F,p)}$ has shape: (o, h-3+2×p+1, w-3+2×p+1) = (c×h×w, o,h,w).
  
  This “3D FC kernel” has finished the “sum” computation and gets waiting for Mat-Mul with the input feature maps.
  
  To align with the squashed 2D input feature maps (n, c×h×w), it needs to be reshaped to 2D: (c×h×w, o×h×w).
  
  Finally, a 3D conv kernel becomes a 2D kernel.
  
  The equivalent FC kernel of a conv kernel is the result of convolution on an identity matrix with proper reshaping.
- Fuse the parameters (μ,σ,γ,β) of BatchNorm into convolution layer based on linearity.
  $$M' = γ⋅[(MF -μ)/σ] + β = γ⋅(MF)/σ + (β - γ⋅μ/σ)$$
  So new kernel and bias: $F' = γ⋅F/σ, \quad b' = (β - γ⋅μ/σ)$
  
  After that, bias-added conv kernels are converted to 2D kernels, which can be added up the main stream: FC3 kernel for inference with only MLP layers.
ResMLP-Net
1. Hierarchical design mimic popular vision models
  - RepMLPBlock and FFN alternate.
  - Can be used as the backbone for downstream tasks.
  - Adjust the amount of parameters in each stage through “set-sharing”.
2. No need for large datasets (JFT300M) or many epochs (300~400) to train. (IN for 100 epochs).
3. Throughput is higher than conventional CNN models. Speed has not much relation with the number of FLOPs.
  
  RepMLP is suitable for highly parallelized devices (GPU) rather than devices with lower computation capacity, like mobile.
4. “Identity” branch is necessary for the performance with providing information in different scale and dimensions.
  
  “set-sharing” increase the number of groups will bring precision.
5. Locality can be observed on the feature maps.
6. RepMLPNet is robust for discontinuity between split patches from big images.
  
  The resolution of Cityscapes dataset doesn’t match the pretrained model. They devided an entire image to small patches.
1 2 3 4 5

class RepMLPNet: RepMLPNetUnit RepMLPBlock
RepMLPBlock cannot resume training after model.locality_injection() because sub-modules have been deleted. Therefore, .locality_injection should be called with a new model before inference.