memo: PyTorch | Tricks for Faster Training

Table of contents

The 1-17 tips come from Faster Deep Learning Training with PyTorch – a 2021 Guide - LORENZ KUHN

1. Consider using another learning rate schedue

  • torch.optim.lr_scheduler.CyclicLR
  • torch.optim.lr_scheduler.OneCycleLR

2. Use multiple workers and pinned memory in DataLoader

torch.utils.data.DataLoader(train_dataset, batch_size=64, num_workers=4, pin_memory=pin_memory) DataLoader-Docs

num_workers

Rule of thumb: set the num_workers to four times the number of available GPUs. Note that increasing num_workers will increase RAM consumption.

pin_memory

When using a GPU it’s better to set pin_memory=True, this instructs DataLoader to use pinned (page-locked) memory and enables faster and asynchronous memory copy from the host to the GPU. Tutorial-Szymon Migacz

pin_memory avoid one implicit CPU-to-CPU (“Pageable Memory” to “Pinned Memory”) copy when perform a.cuda() operation. As the illustration shows in Nvidia Blog I fogot where I got this inspiration: “点对点复制”.

With pinned memory tensors, the copy process a.cuda(non_blocking=True) is asynchronous with respect to host (CPU) SO. If the code is structured as:

  1. a.cuda(non_bloking=True) # copy from cpu to gpu
  2. Perform some CPU operations
  3. Perform GPU operations using a.

The step 1 and step 2 can proceed parallelly. Hence, the maximum time can be saved is the duration of step 2.

3. Max out the batch size

  • Other hyperparameters, like learning rate, have to be adjusted. Rule of thumb: double the learning rate as double the batch size
  • May cause worse generalization performance.

4. Use Automatic Mixed Precision (AMP)

  • The optimizations of some operations use semi-precision (FP16) rather than single-precision (FP32)

5. Consider using another optimizer

  • AdamW outperform Adam resulting from weight decay (rather than L2-regularization).

6. Turn on cudNN benchmarking

Tutorial-Szymon Migacz

7. Avoid unnecessary CPU-GPU synchronizations

(Tutorial-Szymon Migacz):

  • tensor.cpu() or tensor.cuda(), tensor.to()
  • tensor.item() or tensor.numpy()
  • print(cuda_tensor)
  • cuda_tensor.nonzero() retrieves the indices of all non-zero elements;
  • Avoid python control based on cuda tensors, e.g., if (cuda_tensor != 0).all()

The good practice should let the CPU run ahead of the accelerator as much as possible to make sure that the accelerator work queue contains may operations.

Ref

Built with Hugo
Theme Stack designed by Jimmy