Table of contents
The 1-17 tips come from Faster Deep Learning Training with PyTorch – a 2021 Guide - LORENZ KUHN
1. Consider using another learning rate schedue
torch.optim.lr_scheduler.CyclicLRtorch.optim.lr_scheduler.OneCycleLR
2. Use multiple workers and pinned memory in DataLoader
torch.utils.data.DataLoader(train_dataset, batch_size=64, num_workers=4, pin_memory=pin_memory)
DataLoader-Docs
num_workers
Rule of thumb: set the num_workers to four times the number of available GPUs.
Note that increasing num_workers will increase RAM consumption.
pin_memory
When using a GPU it’s better to set pin_memory=True, this instructs DataLoader to use pinned (page-locked) memory and enables faster and asynchronous memory copy from the host to the GPU. Tutorial-Szymon Migacz
pin_memory avoid one implicit CPU-to-CPU (“Pageable Memory” to “Pinned Memory”) copy when perform a.cuda() operation.
As the illustration shows in Nvidia Blog
I fogot where I got this inspiration: “点对点复制”.
With pinned memory tensors, the copy process a.cuda(non_blocking=True) is asynchronous with respect to host (CPU) SO.
If the code is structured as:
a.cuda(non_bloking=True)# copy from cpu to gpu- Perform some CPU operations
- Perform GPU operations using
a.
The step 1 and step 2 can proceed parallelly. Hence, the maximum time can be saved is the duration of step 2.
3. Max out the batch size
- Other hyperparameters, like learning rate, have to be adjusted. Rule of thumb: double the learning rate as double the batch size
- May cause worse generalization performance.
4. Use Automatic Mixed Precision (AMP)
- The optimizations of some operations use semi-precision (FP16) rather than single-precision (FP32)
5. Consider using another optimizer
- AdamW outperform Adam resulting from weight decay (rather than L2-regularization).
6. Turn on cudNN benchmarking
7. Avoid unnecessary CPU-GPU synchronizations
tensor.cpu()ortensor.cuda(),tensor.to()tensor.item()ortensor.numpy()- print(cuda_tensor)
cuda_tensor.nonzero()retrieves the indices of all non-zero elements;- Avoid python control based on cuda tensors, e.g.,
if (cuda_tensor != 0).all()
The good practice should let the CPU run ahead of the accelerator as much as possible to make sure that the accelerator work queue contains may operations.