watch: DM - DLAI | How Diffusion Models Work

Sampling

Steps:

Sample a random noise image from normal distribution;
Use the trained network to predict the noise as opposed to the meaningful object for one step;
Use DDPM algorithm to compute noise-level scaling factors given a timestep: s1, s2, s3 = ddpm_scaling(t)
Subtract the predicted noise from noise image and add extra noise, sample = s1 * (sample - s2 * predicted_noise) + s3 * extra_noise

Repeat steps 2 to 4 to remove noise progressively.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


samples = torch.randn(N_imgs, 3, height, height)

timesteps = 500
for i in range(timesteps, 0, -1):

  # reshape to match input image (N_imgs, 3, height, height)
  t = torch.tensor( [i/timesteps] )[:, None, None, None]

  # extra noise (except for the first step)
  z = torch.randn_like(samples) if i > 1 else 0

  # predict noise
  eps = nn_model(samples, t)

  # remove noise and add extra noise
  samples = denoise_add_noise(samples, i, eps, z)

Adding extra noise before removing noise in the next step avoids collapsing to the average thing of the training dataset.

UNet

UNet can output images of the same size as the input, and assign the image feature onto each pixel.
- Compress image for compact representation;
- Down sampling once, number of channel doubles
Also UNet allows incorporating addtional information during the decoding period.
- Each time up-sampling, the sample is multiplied with context embeddings and plus time embeddings.
- Time embedding indicates timestep of the feature vector, so with that, the “time-dependent” noise level can be determined.
- Context embedding can be text description, so the UNet will be guided to generate specific output.

1
2
3
4
5


# embed context and timestep
cemb1 = self.contextembed1(c).view(-1, self.n_feat * 2, 1, 1)
temb1 = self.timeembed1(t).view(-1, self.n_feat * 2, 1, 1)

up2 = self.up(cemb1 * up1 + temb1, down2)

(2023-07-10)

Training

Train the UNet to identify the noise that was applied to the image.

UNet can segment image (classify each pixel), so here is it used to identify whether every pixel is noise or not?

No, it’s used to make each pixel carried with extracted or introduced features.

Training steps:

Sample a random timestep (noise-level) to make noise;
Add the known noise onto a random training image;
UNet takes as input the noise image and predicts the applied noise as output;
Loss is the difference between the true noise and the predicted noise

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


for ep in range(n_epoch):
  for x in dataloader:
    # perturb data
    t = torch.randint(1, timesteps + 1, (x.shape[0],) ).to(device)
    noise = torch.randn_like(x)
    x_pert = perturb_input(x, t, noise)

    # use network to recover noise
    pred_noise = nn_model(x_pert, t / timesteps)

    # loss is MSE
    loss = F.mse_loss(pred_noise, noise)

    optim.zero_grad()
    loss.backward()
    optim.step()

(2023-07-11)

Controling

Use embedding vector to control the predicted noise.

Embedding is a vector (a set of numbers) that is a representation for something in another space.

Embeddings can perform arithmetic operations.

Paris :
- France :
+ England :
= London :

Noise is what should be removed from the image.

Once the noise is fully subtracted out, what left is the generated image.

By injecting context embeddings into decoder, the output feature vector of the predicted noise becomes specific for that given context.

For example, the noise corresponding to “A ripe avocado” is pixels that are not “A ripe avocado”, and they’ll be removed eventually.

And because the embedding vectors can be mixed, once the mixed noise is removed, what is left is the combination of two objects, i.e. the thing that the context embedding stands for.

For example, an embedding vector of “Avocado armchair” has the information of both “avocado” and “armchair”, so its context will lead the model to predict the noise that is neither “avocado” nor “armchair”.

Context can be one-hot encoded vector for indicating categories, which will result in a specific class of images.

Speeding Up

DDIM skips some timesteps, so it breaks Markov chain process, where each timestep is probablisticly dependent on the previous one.

There is a hyper-parameter step_size to decide how many timesteps are skipped.

DDIM performs better than DDPM under 500 timesteps. The quality of images from DDIM may differs as opposed to DDPM.

Denoising Diffusion implict model predicts a “rough sketch” of the final output, and then it refines it with the denoising process.