Featured image of post Test NexusGS

Test NexusGS

Table of contents

NexusGS

Environment

➀ Replace environment.yml

  1. Problmes:

    1. NexusGS borrowed FSGS’s environment.yml for conda

      • Version of Python, PyTorch mismatched

      • Replace environment.yml after git clone during building the Docker image

  2. Supports:

    1. Correct name, python version in environment.yml

      1
      2
      3
      4
      5
      6
      7
      8
      
      name: nexus
      dependencies:
        - python=3.10
        - pip:
          - torch==2.0.0 --index-url https://download.pytorch.org/whl/cu118
          - torchvision==0.15.1 --index-url https://download.pytorch.org/whl/cu118
          - torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118
          - numpy<2
      
    2. Copy local environment.yml to the Docker image

      1
      
      COPY environment.yml .
      

      ::: {.notes}

      • The local environment.yml is stored with Dockerfile together. :::

➁ Create Requirements.txt

  1. Problmes:

    1. Convert the environment.yml to a requirements.txt for pip.

      Then, a uv environment can be created on the host machine for debugging.

  2. Supports:

    1. requirements.txt doesn’t include python version

    2. requirements.txt doesn’t include cudatoolkit, because Pip does not manage CUDA installtion. r1-Gemini

      In other words, pip requires CUDA 11.8 to be installed on the host system for debugging.

    ::: aside

  3. Actions:

    1. I’ll figure out how to debug inside the Docker container.

➂ Dockerfile Builds Image

  1. Problmes:

    1. Create a running environment for NexusGS
  2. Supports:

    1. CUDA version limitation

      The system can only have one active CUDA installation at a time. Currently, CUDA 11.3 is installed, but it doesn’t support PyTorch 2.0, which is used by NexusGS.

      I don’t want to install another CUDA as it’s time consuming.

    2. Containerized build

      Build a docker container that includes the specific CUDA version.

    3. Driver-CUDA relationship

      The nvidia-driver on the host machine determines the highest CUDA version supported, CUDA-enabled Docker images must be compatible with this driver version.

    ::: aside


  1. Actions:

    1. Create a Dockerfile r1-Gemini

      • TODO: Migrate the source code to GitLab

      • Download source code to HDD for reading

        1
        2
        
        cd /mnt/Seagate4T/04-Projects
        git clone https://github.com/USMizuki/NexusGS.git
        
    2. Build image

      1
      
      docker build -t nexusgs:latest /home/zichen/Projects/NexusGS
      
    3. Run container

      1
      2
      3
      4
      
      docker run -it --rm --gpus all \
        -v /path/to/your/datasets:/workspace/datasets \
        -v /mnt/Seagate4T/04-Projects/NexusGS:/workspace/outputs \
        nexusgs:latest
      

    ::: aside


➃ Pip Build Fail

  1. Problmes:

    1. Pip failed to build submodules/diff-gaussian-rasterization-confidence

      Traceback {{{
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      
      2.889         File "/opt/conda/envs/nexus/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 510, in _build_extensions_serial
      2.889           self.build_extension(ext)
      2.889         File "/opt/conda/envs/nexus/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 264, in build_extension
      2.889           _build_ext.build_extension(self, ext)
      2.889         File "/opt/conda/envs/nexus/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 565, in build_extension
      2.889           objects = self.compiler.compile(
      2.889         File "/opt/conda/envs/nexus/lib/python3.10/site-packages/setuptools/_distutils/compilers/C/base.py", line 655, in compile
      2.889           self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
      2.889         File "/opt/conda/envs/nexus/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 581, in unix_wrap_single_compile
      2.889           cflags = unix_cuda_flags(cflags)
      2.889         File "/opt/conda/envs/nexus/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 548, in unix_cuda_flags
      2.889           cflags + _get_cuda_arch_flags(cflags))
      2.889         File "/opt/conda/envs/nexus/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1773, in _get_cuda_arch_flags 
      2.889           arch_list[-1] += '+PTX'                                                                                                     2.889       IndexError: list index out of range
      2.889       [end of output]
      2.889   
      2.889   note: This error originates from a subprocess, and is likely not a problem with pip.
      2.889   ERROR: Failed building wheel for diff_gaussian_rasterization
      2.889   Running setup.py clean for diff_gaussian_rasterization
      4.088 Failed to build diff_gaussian_rasterization
      4.226 error: failed-wheel-build-for-install
      4.226 
      4.226 × Failed to build installable wheels for some pyproject.toml based projects
      4.226 ╰─> diff_gaussian_rasterization
      
      --------------------
      ERROR: failed to build: failed to solve: process "/bin/sh -c pip install submodules/diff-gaussian-rasterization-confidence" did not complete successfully: exit code: 1
      

      }}}

  2. Supports:

    1. Docker builds images using CPU, with no CUDA devices visible. Therefore, PyTorch cannot detect the GPU’s compute capability. r1-Gemini

    ::: aside


  1. Actions:

    1. Specify target CUDA compute capability

      Set the TORCH_CUDA_ARCH_LIST environment variable to tell the compiler which CUDA architecture to build for.

      1
      2
      
      RUN TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 9.0" pip install submodules/diff-gaussian-rasterization-confidence
      RUN TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 9.0" pip install submodules/simple-knn 
      

➄ Copy Dataset Failed

  1. Problmes:

    1. I don’t want to use the original dataset stored on my hard drive directly in the program, because I’m concerned it might be modified.

      Therefore, I prefer to copy the dataset into the Docker container instead.

    • TL;DR: Copy data is unnecessary.
  2. Supports:

    1. Use COPY docker command

      1
      2
      
      # Copy the local 'datasets' folder into the container's workspace
      COPY ./datasets /workspace/datasets
      
      The workspace in container
      1
      2
      3
      4
      5
      
      NexusGS/
      ├── Dockerfile
      ├── datasets/      <-- Your datasets go here (e.g., LLFF folder)
      ├── scripts/
      └── ... (other project files)
      
      • COPY will result in a bigger image.
    2. Symbolic link is required

      • Data outside of the build context, i.e. the current folder (.), is not accessible for the Docker daemon.

      • Create a symbolic link that points to actual data

        1
        
        ln -s /mnt/Seagate4T/05-DataBank/nerf_llff_data ./LLFF
        

    ::: aside


  1. Actions:

    1. Modify the Dockerfile

      1
      
      COPY ./LLFF/ /workspace/datasets/LLFF
      
      • Note: COPY ./LLFF /workspace/datasets/LLFF is different.

        Docker will process the symbolic link itself, instead of the content inside the LLFF folder.

    2. Rebuild the image

      1
      
      docker build -t nexusgs:latest .
      
    3. Run the container without mounting dataset

      1
      2
      3
      
      docker run -it --rm --gpus all \
        -v /mnt/Seagate4T/04-Projects/NexusGS/outputs:/workspace/outputs \
        nexusgs:latest
      

  1. Results:

    1. COPY data from a symbolic link is not allowed

      • /LLFF is not found. It’s not included in the .dockerignore.
      Error message {{{
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      
      => ERROR [ 8/10] COPY ./LLFF/ /workspace/datasets/LLFF
      0.0s
      ------
       > [ 8/10] COPY ./LLFF/ /workspace/datasets/LLFF:
      ------
      
      Dockerfile:43
      --------------------
        41 |     
        42 |     # Copy the host 'nerf_llff_data' folder into the container's workspace
        43 | >>> COPY ./LLFF/ /workspace/datasets/LLFF
        44 |     
        45 |     # Install the custom submodules
      --------------------
      ERROR: failed to build: failed to solve: failed to compute cache key: failed to calculate checksum of ref da83f08b-6e43-4168-8960-34c6cc4c07ee::i715mm83err1nsdosaei0t2tl: "/LLFF": not found
      
      (base) zichen@zichen-X570-AORUS-PRO-WIFI:~/Projects/NexusGS$ ls -al
      total 16
      drwxrwxr-x 3 zichen zichen 4096 Oct  2 21:31 .
      drwxrwxr-x 6 zichen zichen 4096 Oct  1 21:38 ..
      -rw-rw-r-- 1 zichen zichen 1782 Oct  2 21:31 Dockerfile
      drwxrwxr-x 8 zichen zichen 4096 Oct  2 13:35 .git
      lrwxrwxrwx 1 zichen zichen   41 Oct  2 21:27 LLFF -> /mnt/Seagate4T/05-DataBank/nerf_llff_data
      

      }}}


➅ Dataset Read-Only

  1. Problmes:

    1. Set the volume to read-only to prevent it from being modified r1-Gemini

    ::: aside


  1. Supports:

    1. Append :ro to the end of the volume definition, to make the mounted directory read-only inside the container

      1
      
      -v ./datasets:/workspace/datasets:ro
      

  1. Actions:

    1. Remove the COPY command from the Dockerfile

    2. Rebuild the image

      1
      
      docker build -t nexusgs:latest .
      
    3. Run container

      1
      2
      3
      4
      
      docker run -it --rm --gpus all \
        -v /mnt/Seagate4T/05-DataBank/nerf_llff_data:/workspace/datasets/LLFF:ro \
        -v /mnt/Seagate4T/04-Projects/NexusGS/output:/workspace/output \
        nexusgs:latest
      

➆ Run LLFF fern

  1. Problmes:

    1. Run the example case of LLFF fern

  1. Supports:

    1. NexusGS requires Optical flow data: llff_flow

      1
      2
      3
      4
      5
      6
      7
      8
      
      ├── dataset
          ├── nerf_llff_data
              ├── fern
                  ├── sparse
                  ├── images 
                  ├── images_8
                  ├── 3_views   <-- Copy from llff_flow
                      ├── flow  
      

  1. Actions:

    1. Mount flow data for each scene

      Run container:

      1
      2
      3
      4
      5
      
      docker run -it --rm --gpus all 
        -v /mnt/Seagate4T/05-DataBank/nerf_llff_data:/workspace/dataset/nerf_llff_data:ro \
        -v /mnt/Seagate4T/05-DataBank/llff_flow/fern/3_views:/workspace/dataset/nerf_llff_data/fern/3_views \
        -v /mnt/Seagate4T/04-Projects/NexusGS/output:/workspace/output \
        nexusgs:latest
      
    2. Execute shell script

      Non-HuggingFace script: Run train.py, render.py, and metrics.py

      1
      
      sh scripts/run_llff.sh 0
      

  1. Results:

    1. Output

      Log {{{
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      
      root@d8ffb5435336:/workspace# sh scripts/run_llff.sh 0
      
      [5000, 10000, 30000]
      Optimizing output/llff/fern/3_views
      Output folder: output/llff/fern/3_views [03/10 17:59:57]
      Reading camera 20/20 [03/10 17:59:58]
      2.8834194898605348 cameras_extent [03/10 17:59:58]
      Loading Training Cameras [03/10 17:59:58]
      3it [00:00,  5.73it/s]
      Loading Test Cameras [03/10 17:59:58]
      3it [00:00, 198.50it/s]
      Loading Eval Cameras [03/10 17:59:58]
      14it [00:00, 159.13it/s]
      /opt/conda/envs/nexus/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be requ
      ired to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
        return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
      Number of points at initialisation :  538504 [03/10 17:59:58]
      Number of points at initialisation :  538504 [03/10 17:59:58]
      Training progress:  17%|████████▋                                           | 5000/30000 [01:12<06:02, 68.88it/s, Loss=0.0012369, Points=525819]
      Downloading: "https://download.pytorch.org/models/vgg16-397923af.pth" to /root/.cache/torch/hub/checkpoints/vgg16-397923af.pth
      100%|███████████████████████████████████████████████████████████████████████| 528M/528M [05:42<00:00, 1.62MB/s]
      Downloading: "https://raw.githubusercontent.com/richzhang/PerceptualSimilarity/master/lpips/weights/v0.1/vgg.pth" to /root/.cache/torch/hub/checkpoints/vgg.pth
      100%|███████████████████████████████████████████████████████████████████████| 7.12k/7.12k [00:00<00:00, 15.2MB/s]
      0%|                                                                         | 0.00/7.12k [00:00<?, ?B/s]
      [ITER 5000] Evaluating test: L1 0.05085379630327225 PSNR 21.436749140421547 SSIM 0.701229194800059 LPIPS 0.20546899239222208  [03/10 18:06:57]
      
      [ITER 5000] Evaluating train: L1 0.0012755183658252158 PSNR 52.29058202107747 SSIM 0.9994754989941914 LPIPS 0.0005458221421577036  [03/10 18:07:01]
      Training progress:  33%|█████████████████                                  | 10000/30000 [08:20<05:07, 64.95it/s, Loss=0.0008756, Points=501851]
      [ITER 10000] Evaluating test: L1 0.048601570228735604 PSNR 21.6707280476888 SSIM 0.7072837154070536 LPIPS 0.2020971179008484  [03/10 18:08:22]
      
      [ITER 10000] Evaluating train: L1 0.0009691654122434556 PSNR 55.08801142374674 SSIM 0.9997365872065226 LPIPS 0.00024261641374323517  [03/10 18:08:25]
      
      Training progress: 100%|███████████████████████████████████████████████████| 30000/30000 [14:04<00:00, 35.51it/s, Loss=0.0007041, Points=447917]
      [ITER 30000] Evaluating test: L1 0.04780491938193639 PSNR 21.859390894571938 SSIM 0.7095310091972351 LPIPS 0.201074277361234  [03/10 18:14:07]
      [ITER 30000] Evaluating train: L1 0.000792427861597389 PSNR 56.9059575398763 SSIM 0.9998162388801575 LPIPS 0.00017058776090076813  [03/10 18:14:10]
      
      [ITER 30000] Saving Gaussians [03/10 18:14:10]
      
      Training complete. [03/10 18:14:12]
      Looking for config file in output/llff/fern/3_views/cfg_args
      Config file found: output/llff/fern/3_views/cfg_args
      Rendering output/llff/fern/3_views
      Loading trained model at iteration 30000 [03/10 18:14:15]
      Reading camera 20/20 [03/10 18:14:15]
      2.8834194898605348 cameras_extent [03/10 18:14:15]
      Loading Training Cameras [03/10 18:14:15]
      3it [00:01,  2.91it/s]
      Loading Test Cameras [03/10 18:14:16]
      3it [00:00, 181.28it/s]
      Loading Eval Cameras [03/10 18:14:16]
      14it [00:00, 221.72it/s]
      Rendering progress: 100%|█████████████████████████████████| 3/3 [00:00<00:00,  7.33it/s]
      Rendering progress: 100%|█████████████████████████████████| 3/3 [00:00<00:00,  7.43it/s]
      
      Scene: output/llff/fern/3_views
      Method: ours_30000
      Metric evaluation progress: 100%|█████████████████████████| 3/3 [00:03<00:00,  1.22s/it]
        SSIM :    0.7092038
        PSNR :   21.8406734
        LPIPS:    0.2013377
      

      }}}


Code Understand

➀ DeepWiki

  1. Problmes:

    1. I study code through step-by-step debugging previously. But writing a VSCode debug config file still needs some time.

      I recently noticed DeepWiki can give an detail explanation for a repo.


  1. Supports:

    1. DeepWiki 在 “Environmental Setups”r1-DW 部分解读错误, 但是分析代码文件之间的逻辑关联还是有参考价值的,可以快速了解项目结构。

    ::: aside


➁ NotebookLM

  1. Problmes:

    1. 我看陌生的英文文档会犯困,所以想听音频/看视频。

      我知道 NotebookLM 可以生成音频,辅助学习


  1. Supports:

    1. Insert a URL as source, only one webpage is included, not all content on the site

    2. It generates Flashcards for questioning.

    ::: aside


➂ Read Aloud

  1. Problmes:

    1. 我阅读网页会睡着,我需要工具为我朗读网页

  1. Supports:

    1. AI Text Reader: Read Long Text Aloud Online, No Sign-Up - notegpt.io
      Searched by webpage ai read aloud at DDG

    2. Edge browser has a built-in Read Aloud function.

      • It can jump to where I clicked.

➃ Debug Step-by-Step

  1. Problmes:

    1. Use VSCode to debug the code with the llff fern dataset

  1. Supports:

    1. Hyperparameters in run_llff.sh

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      
      python train.py --source_path dataset/nerf_llff_data/fern \
        --model_path output/llff/fern/3_views \
        --eval --n_views 3 \
        --save_iterations  30000 \
        --iterations 30000 \
        --densify_until_iter 30000 \
        --position_lr_max_steps 30000 \
        --dataset_type llff \
        --images images_8 \
        --split_num 4 \
        --valid_dis_threshold 1.0 \
        --drop_rate 1.0 \
        --near_n 2 \
      

  1. Actions:

    1. Create a launch.json file for debugging

      Python Debugger –> Python File with Arguments


➄ Debug Inside Container

  1. Problmes:

    1. I don’t want to install CUDA on the host machine.

      How to debugg the python program within a docker container?

      PDB? GDB? or VSCode headless?


  1. Supports:

    1. Use debugpy

    ::: aside


Eval on DTU

➀ Ask DeepWiki

  1. Problmes:

    1. How do I prepare the dataset as a DTU data_type?

    ::: aside


➁ Export Point Cloud

  1. Problmes:

    1. Dataflow

      D T U 3 i m a g e s C o l m a p P o i n t s A d d t o O p t i m i z e r 3 d g s R e n d e r E v a l u a t e C o m p . & A c c .

  1. Supports:

    1. Colmap dataset type is determined by the existence of sparse directory

      Sources: scene/__init__.py, line #53

Built with Hugo
Theme Stack designed by Jimmy