Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment conflicts with GPU #19

Open
anar-rzayev opened this issue Jan 18, 2024 · 3 comments
Open

Environment conflicts with GPU #19

anar-rzayev opened this issue Jan 18, 2024 · 3 comments

Comments

@anar-rzayev
Copy link

Hi, thanks a lot for your interest in those issues, I wanted to ask about your comment on the following issue when I want to train Stage1:

24-01-18 01:06:41.203 - INFO: [Phase 1] Training noise model!
24-01-18 01:07:04.744 - INFO: MRI dataset [hardi] is created.
24-01-18 01:07:23.001 - INFO: MRI dataset [hardi] is created.
24-01-18 01:07:23.001 - INFO: Initial Dataset Finished
/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/cuda/__init__.py:104: UserWarning: 
NVIDIA RTX 6000 Ada Generation with CUDA capability sm_89 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA RTX 6000 Ada Generation GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
24-01-18 01:07:23.542 - INFO: Noise Model is created.
24-01-18 01:07:23.542 - INFO: Initial Model Finished
1.8.0 10.2
export CUDA_VISIBLE_DEVICES=2
Loaded data of size: (118, 118, 25, 56)
Loaded data of size: (118, 118, 25, 56)
dropout 0.0 encoder dropout 0.0
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
Traceback (most recent call last):
  File "train_noise_model.py", line 72, in <module>
    trainer.optimize_parameters()
  File "/home/anar/DDM2/model/model_stage1.py", line 62, in optimize_parameters
    outputs = self.netG(self.data)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/anar/DDM2/model/mri_modules/noise_model.py", line 44, in forward
    return self.p_losses(x, *args, **kwargs)
  File "/home/anar/DDM2/model/mri_modules/noise_model.py", line 36, in p_losses
    x_recon = self.denoise_fn(x_in['condition'])
  File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/anar/DDM2/model/mri_modules/unet.py", line 286, in forward
    x = layer(x)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: no kernel image is available for execution on the device

Previously, when I was trying to denoise HARDI150 volumes, I didn't specify any PyTorch version and made Python>=3.10. But after noticing your initial environment.yaml criteria, I changed to very specific cases for torch, torchvision, and python but frankly, I started to get the above issue. Do you think it is better I do not specify any version for PyTorch or they should exactly match?

The reason I ask this is because I feel like from the previous issue when the validation loader was not working, I thought maybe it happened due to version mismatches from the environment file but after getting the above problem, I am still very unsure on this as well.

@anar-rzayev
Copy link
Author

@tiangexiang Any ideas on this?

@tiangexiang
Copy link
Collaborator

Sorry for the late response! The error you reported particularly indicates a mismatch between pytorch version and CUDA version. And you are right that the validation loader failure is probably due to version mismatch as well. In this way, I do recommend duplicating the exact environment as specified in environment.yaml, since it is guaranteed to work (be careful with the CUDA version though! It has to match your own hardware).

@anar-rzayev
Copy link
Author

anar-rzayev commented Feb 2, 2024

@tiangexiang Thanks for the reply. I checked very carefully and to match my hardware, I set up cudatoolkit=11.3 and the corresponding PyTorch versions as follows:

name: ddm2_experiment
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1
  - _openmp_mutex=4.5
  - _pytorch_select=0.1
  - blas=1.0
  - ca-certificates=2022.3.29
  - certifi=2021.10.8
  - cudatoolkit=11.3
  - freetype=2.11.0
  - giflib=5.2.1
  - intel-openmp=2021.4.0
  - jpeg=9d
  - lcms2=2.12
  - ld_impl_linux-64=2.35.1
  - libffi=3.3
  - libgcc-ng=9.3.0
  - libgomp=9.3.0
  - libpng=1.6.37
  - libstdcxx-ng=9.3.0
  - libtiff=4.2.0
  - libuv=1.40.0
  - libwebp=1.2.2
  - libwebp-base=1.2.2
  - lz4-c=1.9.3
  - mkl=2021.4.0
  - mkl-service=2.4.0
  - mkl_fft=1.3.1
  - mkl_random=1.2.2
  - ncurses=6.3
  - ninja=1.10.2
  - openssl=1.1.1n
  - pip=21.2.4
  - python=3.8.13
  - readline=8.1.2
  - setuptools=58.0.4
  - six=1.16.0
  - sqlite=3.38.2
  - tk=8.6.11
  - typing_extensions=4.1.1
  - wheel=0.37.1
  - xz=5.2.5
  - zlib=1.2.11
  - zstd=1.4.9
  - pip:
    - beautifulsoup4==4.11.1
    - charset-normalizer==2.0.12
    - cycler==0.11.0
    - dipy==1.5.0
    - filelock==3.6.0
    - fonttools==4.31.2
    - gdown==4.4.0
    - h5py==3.6.0
    - idna==3.3
    - imageio==2.16.1
    - joblib==1.1.0
    - kiwisolver==1.4.2
    - matplotlib==3.5.1
    - networkx==2.7.1
    - nibabel==3.2.2
    - numpy==1.22.3
    - opencv-python==4.5.4.58
    - packaging==21.3
    - pandas==1.4.1
    - pillow==9.1.0
    - pydicom==2.3.0
    - pyparsing==3.0.7
    - pysocks==1.7.1
    - python-dateutil==2.8.2
    - pytz==2022.1
    - pywavelets==1.3.0
    - pyyaml==6.0
    - requests==2.27.1
    - scikit-image==0.19.2
    - scikit-learn==1.0.2
    - scipy==1.8.0
    - seaborn==0.11.2
    - soupsieve==2.3.2.post1
    - statannot==0.2.3
    - threadpoolctl==3.1.0
    - tifffile==2022.3.25
    - timm==0.4.12
    - torch==1.8.0
    - torchvision==0.9.0
    - tqdm==4.63.1
    - urllib3==1.26.9

Even though the matching happened, I still had problems with the validation part of the training.

Validation
Traceback (most recent call last):
  File "train_noise_model.py", line 92, in <module>
    for _,  val_data in enumerate(val_loader):
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/anar/DDM2/data/mri_dataset.py", line 130, in __getitem__
    raw_input = raw_input[:,:,0]
IndexError: index 0 is out of bounds for axis 2 with size 0

Even trying the latest versions for torch & torchvision did not help at all 🙁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants