memo: PyTorch | Reproducibility

2022-8-21: Experiments were conducted on the code of PixeNerf. Trying to get the identical loss curves every time.

  1. 在 train.py 中设置np.random.seed(0)torch.manual_seed(0),使每次训练时的图片和像素,以及验证时的object和视图是一样的;

  2. 在 trainer.py 中设置 worker_fn

  3. 在 nerf.py 中设置torch.manual_seed(2201),每次取一样的随机数,loss曲线有的地方还是有0.1的差异。

  4. train_set 通过ColorJitterDataset 做了颜色增强,data.util.py中加np.random.seed(0),然后E 0 的第1个batch 的psnr就相同了(10.55053)。

  5. 设置 models.py 中的运算为 deterministic,pytorch=1.6.0 (Docs中)只有以下两个设置,但好像没效果。

    1
    2
    
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    
  6. pytorch=1.8 才能使用 torch.use_deterministic_algorithms(),于是:

    • 卸载conda uninstall pytorch,重装了1.12.1,报错:

      pyparsing.exceptions.ParseException: Expected '}', found '='

    • 又重装了 1.10.2 之后: 在from torch.utils.tensorboard import SummaryWriter处报错:

      AttributeError: module 'distutils' has no attribute 'version'

      降级:

      1
      2
      
      pip uninstall setuptools
      pip install setuptools==59.5.0
      

      然后那个参数解析pyparsing还是报错,不知道怎么解决。 pytorch forum

    • 把环境删了,修改 environment.yml 中的版本:pytorch==1.11.0, torchvision==0.12.0 (版本号相差1),重新创建环境:conda env create -f envxx.yml。然后在 torch.matmul() 处报错:

      1
      2
      3
      4
      5
      6
      7
      
      RuntimeError: Deterministic behavior was enabled with either 
      `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, 
      but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. 
      To enable deterministic behavior in this case, 
      you must set an environment variable before running your PyTorch application: 
      CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. 
      For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
      
    • 在程序开头设置环境变量:export CUBLAS_WORKSPACE_CONFIG=:4096:8

      又在 loss.backward() 中 autograd 时报错:

      1
      2
      3
      4
      5
      6
      
      RuntimeError: grid_sampler_2d_backward_cuda does not have a deterministic implementation, 
      but you set 'torch.use_deterministic_algorithms(True)'. 
      You can turn off determinism just for this operation, 
      or you can use the 'warn_only=True' option, if that's acceptable for your application. 
      You can also file an issue at https://github.com/pytorch/pytorch/issues to 
      help us prioritize adding deterministic support for this operation.`
      
    • 加上参数:torch.use_deterministic_algorithms(True, warn_only=False) 可以运行;再加上torch.backends.cudnn.benchmark = False,除batch 1外,两次实验结果仍不完全一致,而且性能下降很多,好像是pytorch版本导致的,放弃。又重装回原环境。 torch-reproducibility-doc; It’s introduced in 1.8

python train/train.py --name dtu_origin --conf conf/exp/dtu.conf --datadir data/DTU_Dataset/rs_dtu_4 --nviews 3 --gpu_id='0 2' --epochs 400_000

Built with Hugo
Theme Stack designed by Jimmy