memo: DL | Convolution Layers

Convolution

对像素重新计数，并计算新的“像素值”的过程
卷积核从左上角开始，每次向左或向下滑动，并与其重叠的部分做内积（对应项相乘再求和）
提取特征
不做填充（valid padding），卷积后的输出尺寸为 $\lfloor\frac{n-k}{s}\rfloor+1$
- 图像尺寸：n×n
- 卷积核尺寸：k×k
- 步长：s
- 卷积核从左上角开始，每次向左滑动一列，最后停靠在右边缘，这时卷积核左侧的像素数再加上1（当前次），就是输出的尺寸 n-k+1。
  
  比如下图一行有5个像素，k=2，卷积核前面有3个再加上最后1个: 3+1 =4。
- 如果步长s=2，不能正好滑到最后，可以丢掉多余的部分或者填充像素。2
- 如果步长s=3，计算式应为：$\frac{n-k}{s}+1 = \frac{5-2}{3}+1 =2$
- 如果步长s=4，计算式应为：$\lfloor\frac{n-k}{s}\rfloor+1 = \lfloor\frac{5-2}{4}\rfloor+1 =1$
对于 same padding, 输出尺寸：$\lfloor \frac{(n+2\times p-k)}{s} \rfloor+1$
就是先对原始图像补充 p 圈像素，再做卷积。

Padding

在图像外围填充一圈或几圈像素，像素值通常为0
保证输出与输入的尺寸一致。1
常见两种padding：
1. valid padding: 不填充，只使用原始图像
2. same padding: 填充边缘，使卷积结果与输入尺寸一致。
  为了使输出尺寸仍等于n，即：$\frac{n-k+2*p}{s}+1 = n$，解得：$p=\frac{(n-1)*s+k-n}{2}$；如果s=1，则 $p=\frac{k-1}{2}$。

Stride

卷积核滑动的步长 s
stride=1，则卷积核每次向左滑动一列或者向下滑动一行
压缩信息：成比例缩小输出的尺寸，stride=2，则输出为输入的1/2。1

Pooling

保留特征，并减少计算量

max-pooling: 近视眼，只能看到最大的;
average-pooling

(2023-12-12)

F.avg_pool3d

Number of channels doesn’t change, and D, H, W shrink. Docs

1
2


inp = torch.ones(1, 3, 7, 9, 13)
F.avg_pool3d(inp, (4, 2, 3), stride=4, padding=1) # (1,3,2,3,4)

MVSNet uses AvgPool3d to compute the sum of every 4 depth-probability planes:

1
2
3
4
5


avg4 = F.avg_pool3d(
     F.pad(prob_volume.unsqueeze(1),# (bs,C=1,D=192,H=128,W=160)
           pad=(0, 0, 0, 0, 1, 2)), # (bs,C=1,D=195,H=128,W=160)
     (4, 1, 1), stride=1, padding=0)# (bs,C=1,D=192,H=128,W=160)
prob_volume_sum4 = 4 * avg4

Deconvolution

Complexity of CNN

ConvTranspose2d()

1

torch.nn.ConvTranspose2d()

Deconvolution visualization

(2023-07-19)

torchvision.models.resnet34

ResNet - PyTorch | Source code

layers: [3,4,6,3] means that layer1 has 3 BasicBlock (resnet50 is Bottleneck) convolution blocks, and layer2 has 4 blocks, and layer3 has 6 blocks, and layer4 has 3 blocks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


def _forward_impl(self, img)    # img.shape: (3, 300, 400)
  x = self.conv1(img)   # nn.Conv2d(), channels↑ and size↓, (1, 64, 150, 200)
  x = self.bn1(x)       # batch norm, shape doesn't change, (1, 64, 150, 200)
  x = self.relu(x)      # act, shape keeps the same, (1, 64, 150, 200)
  x = self.maxpool(x)   # max-pooling, size↓, (1, 64, 75, 100)

  # 3 identical 'BasicBlock'
  x = self.layer1(x)    # stride in the 1st block is 1, so shape doesn't change, (1, 64, 75, 100)

  # 4 identical 'BasicBlock'
  x = self.layer2(x)    # stride in the 1st block is 2, so size↓, (1, 128, 38, 50)

  # 6 identical 'BasicBlock'
  x = self.layer3(x)    # stride in the 1st block is 2, so size↓, (1, 256, 19, 25)

  # 3 identical 'BasicBlock
  x = self.layer4(x)    # stride in the 1st block is 2, so size↓, (1, 512, 10, 13)

  x = self.avgpool(x)   # target output is (1,1), so (1, 512, 1, 1)

  x = torch.flatten(x,1)
  x = self.fc(x)        # 512 -> num_classes

(2023-09-12)

F.pad

Padding an image along width, or height, or depth directions. Docs

The order of dimensions should be arranged according to Width, Height, Depth, e.g., padding the last 3 dimensions: F.pad(x, (padding_left, padding_right, padding_top, padding_bottom, padding_front, padding_back) )

So the order of (l,r,t,b,f,b) is reverse against an image tensor: (Depth, H, W)

nn.Conv3d

Input: (B, Ch_in, D, H, W); Output: (B, Ch_out, D_out, H_out, W_out)

For example, a tensor with shape of (2, 3, 4, 224, 224) is 2 video clips with 3 frames and each frame is a 4-channel image with size 224x224.

After convolution with a kernel of size (2, 4, 4), it can be transformed to (B=2, Ch_out=128, D=2, H=56, W=56)

Iterate each channel for D frames to convolve with a unique 3D kernel. Once every channel has multiplied by a kernel, all the 4 weighted channels are summed directly to form one of output channels.

Depthwise Convolution

(2023-07-25)

Separate a convolution into two steps:

Shrink the size of the feature maps using 1-channel plane-wise kernel (Depthwise Conv);
Expand the number of channels using 1x1 kernel (Pointwise Conv).
FLOPs reduced, but the IO access increased resulting in slower inference. Depth-wise Convolution - 沈景兵的文章 - 知乎

Expanding channels process costs the equal amount of FLOPs in normal convolution and pointwise convolution. For example, when expanding 3 channels to 256 channels, each pixel performs multiplication 256 times.

However, the depthwise convolution doesn’t multiply a kernel by each channel and sum them together, but only multiply a kernel by only one channel. A Basic Introduction to Separable Convolutions - Medium
Fewer parameters: 3x3x253 kernels are replaced with 1x1x256 kernels for every pixel on the resultant feature map.

(2024-06-22)

A video expalnation: Nik_Li - 15分钟看懂depthwise convolution
- 第一阶段的 depth-wise convlution 是一直不变的，#out channels 是由第二阶段的 point-wise convlution 做多次决定。
  
  depth-wise 对比传统卷积类比于：一个 Encoder（第一阶段固定）和多个 Decoder（多个输出通道）；而传统卷积层是：每个输出通道的“输入量”是不一样的（第一阶段不固定），每一次卷积就是一个 encoder-decoder 组合。
- 如果只有一个输出通道，两种卷积是等价的，因为表达式可以写成一样的。如果是有多个输出通道，那么就需要求 Encoder 需要有很好的能力，支持被不同的 “decoder” 解释成不同的输出，以对应不同的信息