watch: CLIP Paper Walkthrough

Source video: CLIP 论文逐段精读【论文精读】- 跟李沐学AI ~ Bilibili 2022-02-10

CLIP (Contrastive Language-Image Pre-Training) Code

Features

Large-scale dataset: 4e8 pairs of image and caption.
Self-supervised learning strategy (pretext task): Given an image, find the matched text vector from candidates
- Contrastive learning needs positive and negative samples.
  
  There is only one correct text vector for an image, while the remaining text vectors are served as negative samples.
Loosen the target: pairing rather than predicting next word
Good transferability: Able to generalize to unseen classes based on the text prompts.
- Leverage text to enhance image features with semantic understanding

\begin{algorithm}
\caption{CLIP}
\begin{algorithmic}
\STATE If = ImageEncoder(I) $\quad$ \COMMENT{(n,h,w,c)→(n, di)}
\STATE Tf = TextEncoder(T)  $\quad$ \COMMENT{(n,l)→(n,dt)}
\STATE Ie = Linear projection (If)  $\quad$ \COMMENT{(n,de)}
\STATE Te = Linear projection (Tf)  $\quad$ \COMMENT{(n,de)}
\STATE logits = Inner Product (Ie, Te.T)
\STATE labels = np.arange(n)
\STATE lossᵢ = CrossEntropy(logits, labels, axis=0)
\STATE lossₜ = CrossEntropy(logits, labels, axis=1)
\STATE loss = (lossᵢ + lossₜ)/2
\end{algorithmic}
\end{algorithm}

Experiments

Backbone model: The image encoder can be ResNet or ViT, text encoder is a transformer
Zero-shot transfer: No downstream task adaptation, apply the pre-trained model directly onto the unseen data.
Few-shot transfer: Given a few images, fine-tune or linearly probe the pre-trained model. CLIP outperforms all the previous pre-trained models supervised by labels.
Full-data transfer: Better than other zero-shot model.
The features extracted by previous pre-trained models only have the image modality, while the image features of CLIP are learned under the instructions of text description, so the image features have fused with text modality and guided to semantic understanding.
Mix precision training can save half of memory without losing performance.
Prompt engineering: Fit the label into a sentence by putting it into prompt templates to close gap with the training set, i.e., image-caption pairs.

They made 80 templates for describing different situations in images, such that more specific context is confined to help find the solution in a small possible range.
Unrealistic and abstract datasets, like MNIST, counting number of objects, are difficult for CLIP because they are hard to describe with language. Otherwise, as long as the describable object exists in the image, CLIP can recognize it.

Limitations

CLIP is not the SOTA on ImageNet, but only in the zero-shot task.
Cannot understanding abstract concepts: “abnormal”, “safe”
Out-of-distribution when performing zero-shot inference will ruin the generaliability of CLIP: MNIST (different from natural images) isn’t included in the training set.
Zero-shot inference of CLIP requires the “new label” is provided in the candidates to do a multiple choice question.

By contrast, let model generate caption from image will make the data loop. But that is infesible because massive computation with low-efficient training techinics.
Data utilization is inefficient with too many training images. Dataloader spitting image one-by-one needs long time.
Datasets bias: Hyperparameter tunning is based on ImageNet; The testing performance is based on chosen 27 datasets.
Training set is from internet without filtering, so the model may learned malicious information.
Performance of few-shot learning sometimes is inferior to zero-shot scenario weirdly.

Footer:

The pre-trained method isn’t open-source. But the model is open source.

Code

Repo

Install CLIP:

1
2
3


conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

Zero-shot classification:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
  image_features = model.encode_image(image)
  text_features = model.encode_text(text)

  logits_per_image, logits_per_text = model(image, text)
  probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)
# prints: [[0.9927937  0.00421068 0.00299572]]

Table of contents

Features

Experiments

Limitations

Code