read: Transfer - Adapter

Code | Arxiv | ProjPage | OpenReview | Ytb

Note

Abs

Transfer between different modalities: image classification and video understanding
Add 3 new layers inside each transformer block, not at the very end of the model.
3 new layers:
1. Spatial adapter is after self-attention;
2. Temporal adapter is after two self-attention,
3. Joint adapter is a bypass branch of the MLP layers.
Frozen pre-trained parameters and optimize only the new layers to transfer the pre-trained model onto another task.

Intro

Two directions: adding temporal module onto or inflating an image transformer model both have drawbacks: heavy-computation full fine-tunning is required.

Pre-trained image models have good transferability.
Fully fine-tuning a transformer-based image model is uneconomical.
Parameter-efficient finetuning was applied on LLM for downstream tasks.

Method

ViT consists of 12 encoder blocks (MSA and MLP).
- An image is split into N pathes, which will be projected to D channels;
- The input to MSA is each patch attached class token channel and added positional encoding.
Space-only model (baseline, no temporal modelong): Apply pre-trained frozen ViT onto video by processing each frame independently.
- Each frame will be represented by the final class token.
- The token of each frames are averaged to form a vector for predicting
Spatial adaptation is adding an adapter after the self-attention (pre-trained MSA) fuses N+1 patches.
- An adapter is a bottleneck, i.e, Reduce-Act-Expand with skip connection.
- This can achieve comparable performance compared with space-only baseline, because image model learns spatial feature well.
Temporal modeling reused the self-attention parameters again, whereas the T frames got fused by reshapeing the tensor.
- Another adapter is appended for adapting the generated temporal features.
- Temporal modeling is performed ahead of spatial modeling, so the adapter is removed skip connection and initialized as zero to avoid disrupting the perfomance of the original model.
- By reusing the MSA, the number of parameters is maintained.
Joint adapation jointly fits the temporal features and spatial features.
- This adapter also doesn’t has skip connection.
- Average the final class token of each frame and pass it to classification head.

Experiments

Task: classification video?

8 frames
Memory: AIM based on Swin-B pre-trained with IN-21K occupies 9GB.
Underperform on temporal-heavy video because the temporal modeling is simply reusing the spatial modeling parameters.

Discussion

Deeper layer needs adaptation for task-specific features, while shallow layer may not.

Conclusion

Transfer models trained with other sequence data, like text and audio for video action recognition.

flowchart TD input("Image
(224,224,3)") --> cls("Class token
(1,768)") & pe("Position Embedding
(197,768)") input --> feat("Conv2d (16x16,s16)
(14,14)") cls & feat --> Cat pe & Cat --> add1("Add") add1 --> msa1("MSA") --> Tadap --> msa2("MSA") --> Sadap Sadap --> ineck("Inverse
bootleneck") Sadap --> Jadap add1 & ineck & Jadap --> add2("Add") --> x

Play

Debug code with experiment settings in “run_exp.sh”

Environment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


conda env create -f ./environment.yml
conda activate AIM

# install CLIP
pip install git+https://github.com/openai/CLIP.git

# install mmaction2
python setup.py develop

# install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Dataset diving48

To prepare the dataset diving48 , I downloaded the repo MMAction2 Documentaions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


conda create --name openmmlab python=3.8 -y
conda activate openmmlab
conda install pytorch torchvision -c pytorch

# Step 1:
pip install -U openmim
mim install mmengine
mim install mmcv
mim install mmdet
mim install mmpose

# Step 2:
git clone https://github.com/open-mmlab/mmaction2.git
cd mmaction2
pip install -v -e .

Following “Download from Official Source” section.
1. Download annotations using their shell script. bash download_annotations.sh
2. Download videos “Diving48_rgb.tar.gz” (9.6G)
3. Only extract the rgb frames: bash extract_rgb_frames_opencv.sh
4. Generate file list using program: bash generate_videos_filelist.sh
Make a symbolic link to “mmaction2/data” in “adapt-image-models”: ln -s /home/zichen/Downloads/mmaction2/data/ ./

Format

annotation file “data/diving48/diving48_train_list_videos.txt” includes: filename and class label of each video

Config for 1080Ti

Train with 1 video cannot make the acc increase

Default configs (8 videos, 32 frames) will cause 1 1080Ti OOM. (“configs/recognition/vit/vitclip_large_diving48.py”)

Override the number of videos in config file with args:

1
2
3


"args":[
    "--cfg-options", "data.videos_per_gpu=1"
]

But the top1_acc didn’t grow:

1
2
3
4
5


2023-08-31 12:22:11,768 - mmaction - INFO - 
Epoch [1][4180/15027]  lr: 6.003e-05, 
eta: 6 days, 8:30:03, time: 0.709, data_time: 0.001, 
memory: 5659, top1_acc: 0.0500, top5_acc: 0.3500, 
loss_cls: 3.4383, loss: 3.4383

"data.videos_per_gpu=2" will OOM.

Reduce num_frames

.vscode/launch.json is made based on “run_exp.sh”:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


"args": [
    "--cfg-options", "model.backbone.pretrained=openaiclip", 
                     "work_dir=work_dirs_vit/diving48/debug",

                     "data.videos_per_gpu=8",
                     "model.backbone.num_frames=3",
                  // The follwings cannot change
                  // "train_pipeline[1].clip_len=3", 
                  // "val_pipeline[1].clip_len=3"

    "--train_clip_len", "{\"1\": {\"clip_len\": 3}}"
]

(2023-09-06) The cfg.data.train['pipeline']['clip_len'] didn’t changed, which still equals 32. Consequently, the images x passed to forward(self, x) of model ViT_CLIP has the shape (256, 197, 768)

However, the instance variable self.num_frames of the backbone model ViT_CLIP was changed to 3.

Then, the einops.rearrange cannot parse the dimensionality in: x = rearrange(x, '(b t) n d -> (b n) t d', t=self.num_frames)

1
2


einops.EinopsError: 
Shape mismatch, can't divide axis of length 256 in chunks of 3

Dataset is built based the key cfg.data.train, thus, its values are also required to update:

1
2
3


cfg.merge_from_dict(dict(train_pipeline=args.train_clip_len, val_pipeline=args.train_clip_len))
update_option = {'data': {'train': {'pipeline': args.train_clip_len}, 'val': {'pipeline': args.train_clip_len}}}
cfg.merge_from_dict(update_option)

Start training:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


export CUDA_VISIBLE_DEVICES=4
python -m torch.distributed.launch \
--nproc_per_node=1 --master_port=29500 \
tools/train.py \
"configs/recognition/vit/vitclip_base_diving48.py" \
--launcher="pytorch" \
--test-last \
--validate \
--cfg-options model.backbone.pretrained="openaiclip" \
              work_dir="work_dirs_vit/diving48/debug" \
              data.videos_per_gpu=8 \
              model.backbone.num_frames=3 \
--train_clip_len "{\"1\": {\"clip_len\": 3}}"

Optimization

Souce code

AdamW: lr=3e-4, weight_decay=0.05,
LR scheduler: CosineAnnealing

Pseudocode

With backbone: ViT_CLIP

\begin{algorithm}
\caption{main()}
\begin{algorithmic}
\PROCEDURE{Config}{cfg, args}
  \STATE args = parse\_args()
  \PROCEDURE{Config.fromfile}{args.config}
    \STATE model settings
    \STATE dataset settings: ann\_file, train\_pipeline,...
    \STATE optimizer settings
    \STATE learning policy
    \STATE runtime settings
  \ENDPROCEDURE

\ENDPROCEDURE
\STATE $\newline$

\PROCEDURE{build-model}{cfg.model}
\COMMENT{Construct ViT with Adapters added}
  \PROCEDURE{build-localizer}{cfg}
    \PROCEDURE{LOCALIZERS.build}{cfg}
      \PROCEDURE{BaseRecognizer}{}
        \STATE $\newline$

        \PROCEDURE {builder.build-backbone}{backbone}
          \STATE BACKBONES.build(cfg)
        \ENDPROCEDURE
        \STATE $\newline$


        \PROCEDURE {init-weights}{}
          \STATE self.backbone.init\_weights()
          \COMMENT{Load pretrained state\_dict}
        \ENDPROCEDURE
        \STATE $\newline$

      \ENDPROCEDURE
    \ENDPROCEDURE
  \ENDPROCEDURE

\ENDPROCEDURE
\STATE $\newline$

\STATE datasets = [build\_dataset(cfg.data.train)]
\STATE $\qquad$ build\_from\_cfg(cfg, DATASETS)
\STATE $\qquad$ 11 transforms operations

\STATE Freeze params.requires\_grad=False
\STATE $\newline$

\PROCEDURE{train-model}{model,datasets,cfg,...}
\STATE dataloader\_settings
\STATE data\_loaders = build\_dataloader(dataset, dataloader\_setting)
\STATE optimizer = build\_optimizer(model, cfg.optimizer)
\STATE amp settings
\STATE fp16 settings
\STATE register DistOptimizerHook
\STATE build validation dataset and dataloader
\STATE $\newline$

\PROCEDURE{runner.run}{data\_loaders, cfg.workflow, cfg.total\_epochs,**runner\_kwargs}
  \STATE DistOptimizerHook.before\_run(self, runner): 
  \STATE $\qquad$ runner.optimizer.zero\_grad()

  \STATE BaseRecognizer.train\_step(self, data\_batch,)

  \STATE losses = self(imgs, label)
  \PROCEDURE {Recognizer3D.forward-train}{img, label}
    \STATE x = BaseRecognizer.extract\_feat(imgs)
    \STATE $\qquad$ self.backbone(imgs)
    \COMMENT{ViT\_CLIP.forward()}
  
  \ENDPROCEDURE
  \STATE $\qquad$ self.forward\_test(img, label)
\ENDPROCEDURE

\ENDPROCEDURE

\end{algorithmic}
\end{algorithm}

Debug VideoSwin

The pretrained weights of ViT_CLIP are obtained from an initialized clip model:

1
2
3
4
5


clip_model, preprocess = clip.load("ViT-B/16", device="cpu")
pretrain_dict = clip_model.visual.state_dict()  # param
del clip_model
del pretrain_dict['proj']
msg = self.load_state_dict(pretrain_dict, strict=False)

Source code

However, the weights of Swin Transformer needs to be loaded from file. Source code
- Reminded by this issue MMCV load pretrained swin transformer
- Pretrained Swin Transformer (Swin-B 224x224, “swin-base_3rdparty_in21k.pth”) of open-mmlab (mmpretrain) doesn’t have the key: ‘model’, so it mismatches the code.
1 2 3

def inflate_weights(self, logger): checkpoint = torch.load(self.pretrained, map_location='cpu') state_dict = checkpoint['model']
- While the pretrained swin from microsoft can be successfully loaded.
The Swin Transformer has not been trained with CLIP, only on ImageNet21K.
- The author adds adapters to “Swin-B_IN-21K” SwinTransformer2D (“swin2d.py”) in “mmaction/models/backbones/ swin2d_adapter.py” as clarified in issue18.
- The “swin2d_adapter” is compared with SwinTransformer3D (VideoSwin, “swin_transformer.py”) in Table 6. And most of their experiments are based on ViT_CLIP and compared with TimeSformer.
- SwinTransformer2D is adapted by settings: “configs/recognition/swin/ swin2d_adapter_patch244_window7_kinetics400_1k.py”.
  
  Whereas, the config file: “configs/recognition/swin/ swin_base_patch244_window877_kinetics400_1k.py” is for the original VideoSwin SwinTransformer3D.
Arguments pretrained: str and pretrained2d: bool of class SwinTransformer3D originate in VideoSwin, which adapted pretrained 2D swin transfromer to 3D.

AIM codes are based on VideoSwin.

Following VideoSwin, pretrained is supposed to be a path to the pretrained model, which should be downloaded in advance. An example is KeyError: ‘patch_embed.proj.weight’ #22

Based on the above, the args in launch.json should be set as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


// Swin-B settings
"args": [
    "--nproc_per_node", "1", // GPUs
    "--master_port", "29600",

    "tools/train.py",
    "configs/recognition/swin/swin2d_adapter_patch244_window7_kinetics400_1k.py",
    "--launcher", "pytorch",
    "--test-last",
    "--validate",
    "--cfg-options", "model.backbone.pretrained=work_dirs_swin/swin_base_patch4_window7_224_22k.pth", 
                     "work_dir=work_dirs_swin/K400/debug",
                     "data.videos_per_gpu=8",
                     "model.backbone.num_frames=3",
    "--train_clip_len", "{\"1\": {\"clip_len\": 3}}"
]

Dataset SSv2

AIM-Swin only has configuration file for K400 and ssv2 datasets. K400 has 240K training videos, which are massive. So I choose the smaller one, SSv2, which has 169K training videos.

Refer to the guide of SSv2 - mmaction2

Annotations: Once signed in your Qualcomm account, download “Labels” into “data/sthv2/annotations/” from homepage (Need to acknowledge the agreement before jumping to the download page)

1
2
3
4
5
6
7


unzip 20bn-something-something-download-package-labels.zip

# Rename to match the python code "parse_file_list.py"
mv data/sthv2/annotations/labels/train.json data/sthv2/annotations/something-something-v2-train.json
mv data/sthv2/annotations/labels/validation.json data/sthv2/annotations/something-something-v2-validation.json
mv data/sthv2/annotations/labels/test.json data/sthv2/annotations/something-something-v2-test.json
mv data/sthv2/annotations/labels/labels.json data/sthv2/annotations/something-something-v2-labels.json

Videos: Download 20 files into “mmaction2/data/sthv2/”.

By executing the following 2 commands, 220847 webm videos (19G) are extracted into the folder: “sthv2/20bn-something-something-v2”

1
2
3
4
5


unzip 20bn-something-something-v2-\??.zip
cat 20bn-something-something-v2-?? | tar zx

# Rename to match the script below and configs in AIM
mv 20bn-something-something-v2/ videos/

Split: Generate list
1 2

cd mmaction2/tools/data/sthv2/ bash generate_videos_filelist.sh
Two .txt files “sthv2_train_list_videos.txt” and “sthv2_val_list_videos.txt” are created under “data/sthv2/”.

To debug AIM-swin with SSv2, specify the config file as “configs/recognition/swin/swin2d_adapter_patch244_window7_sthv2_1k.py” in “launch.json”.

2023-09-12 15:41:48,166 - mmaction - INFO - Epoch [1][28160/84457] lr: 6.601e-05, eta: 21 days, 10:31:39, time: 0.365, data_time: 0.001, memory: 1420, loss_cls: 4.3897, loss: 4.3897

Forward swin

\begin{algorithm}
\caption{SwinTransformer2d\_Adapter}
\begin{algorithmic}

\PROCEDURE{forward}{x: (B,T,D,H,W)}
  \STATE Conv3d extracts feat maps: (B, C, num\_Ttokens, H', W')
  \STATE $\newline$

  \PROCEDURE{SwinTransformer2d-Adapter}{B*num\_Ttokens, H*W, C}
    \STATE 2 SwinTransformerBlock
      \STATE $\quad$ rearrange
      \STATE $\quad$ LN1
      \STATE $\quad$ Temporal MSA mix "num\_Ttokens" of feat maps
      \COMMENT{even blks}
      \STATE $\quad$ Temporal Adapter
      \STATE $\quad$ rearrange back
      \STATE $\newline$

      \STATE $\quad$ LN1
      \STATE $\quad$ Shift window rows and cols
      \STATE $\quad$ window\_partition
      \COMMENT{reshape}
      \STATE $\quad$ WindowAttention mix "pixels" in each window
      \STATE $\quad$ Spatial Adapter
      \STATE $\quad$ window\_reverse
      \STATE $\quad$ Shift window rows and cols
      \STATE $\newline$

      \STATE $\quad$ Squash feat maps to 1D
      \STATE $\quad$ Skip connect with the features before S\_adap
      \STATE $\quad$ LN2
      \STATE $\quad$ MLP + Joint Adapter
    \STATE PatchMerging: (B*num\_Ttokens, H'/2*W'/2, 2*C)
    \STATE $\newline$

    \STATE 2 SwinTransformerBlock
    \STATE PatchMerging: (B*num\_Ttokens, H'/4*W'/4, 4*C)
    \STATE $\newline$

    \STATE 18 SwinTransformerBlock
    \STATE PatchMerging: (B*num\_Ttokens, H'/8*W'/8, 8*C)
    \STATE $\newline$

    \STATE 2 SwinTransformerBlock
  \ENDPROCEDURE
  \STATE $\newline$

  \STATE LN
  \STATE rearrange to (B,C,T,H,W)
  \STATE cls\_head, i.e. I3DHead (A linear layer)
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}

The reason of setting window_size to 7 may be that the resolution of feature maps is (56,56), which can shrink gradually to (7,7).
Adapter: Pass the attended features to a bottleenck (2-layer MLP) for adapting them.

Adapted Swin

Differences of the adapted Swin (“swin2d_adapter.py”) from the baseline model SwinTransformer2D (“swin_transformer.py”):

1
2


diff mmaction/models/backbones/swin2d_adapter.py \
     mmaction/models/backbones/swin2d.py

swin2d has a temporal adapter more than swin_transformer
swin2d_adapter has

No joint adapter