memo: PyTorch | Tricks for Faster Training

The 1-17 tips come from Faster Deep Learning Training with PyTorch – a 2021 Guide - LORENZ KUHN

1. Consider using another learning rate schedue

  • torch.optim.lr_scheduler.CyclicLR
  • torch.optim.lr_scheduler.OneCycleLR

2. Use multiple workers and pinned memory in DataLoader

torch.utils.data.DataLoader(train_dataset, batch_size=64, num_workers=4, pin_memory=pin_memory) DataLoader-Docs

num_workers

Rule of thumb: set the num_workers to four times the number of available GPUs. Note that increasing num_workers will increase RAM consumption.

pin_memory

When using a GPU it’s better to set pin_memory=True, this instructs DataLoader to use pinned (page-locked) memory and enables faster and asynchronous memory copy from the host to the GPU. Tutorial-Szymon Migacz

pin_memory avoid one implicit CPU-to-CPU (“Pageable Memory” to “Pinned Memory”) copy when perform a.cuda() operation. As the illustration shows in Nvidia Blog I fogot where I got this inspiration: “点对点复制”.

With pinned memory tensors, the copy process a.cuda(non_blocking=True) is asynchronous with respect to host (CPU) SO. If the code is structured as:

  1. a.cuda(non_bloking=True) # copy from cpu to gpu
  2. Perform some CPU operations
  3. Perform GPU operations using a.

The step 1 and step 2 can proceed parallelly. Hence, the maximum time can be saved is the duration of step 2.

3. Max out the batch size

  • Other hyperparameters, like learning rate, have to be adjusted. Rule of thumb: double the learning rate as double the batch size
  • May cause worse generalization performance.

4. Use Automatic Mixed Precision (AMP)

  • The optimizations of some operations use semi-precision (FP16) rather than single-precision (FP32)

5. Consider using another optimizer

  • AdamW outperform Adam resulting from weight decay (rather than L2-regularization).

6. Turn on cudNN benchmarking

Tutorial-Szymon Migacz

7. Avoid unnecessary CPU-GPU synchronizations

(Tutorial-Szymon Migacz):

  • tensor.cpu() or tensor.cuda(), tensor.to()
  • tensor.item() or tensor.numpy()
  • print(cuda_tensor)
  • cuda_tensor.nonzero() retrieves the indices of all non-zero elements;
  • Avoid python control based on cuda tensors, e.g., if (cuda_tensor != 0).all()

The good practice should let the CPU run ahead of the accelerator as much as possible to make sure that the accelerator work queue contains may operations.

Ref