Failing to train on multiple GPUs in jupyter notebook #2602

chem1kal1 · 2024-03-06T04:03:29Z

After running SCVI.setup_anndata and creating the model, when training the model with devices != 1 (ex devices = 2) an error occurs:

RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

import your h5ad data,

scvi.model.SCVI.setup_anndata(adata, layer = "counts", categorical_covariate_keys=['Sample'],
                             continuous_covariate_keys=['pct_counts_mt', 'total_counts', 'pct_counts_ribo'])
model = scvi.model.SCVI(adata)
model.train(devices = 2)

RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Versions:

scvi-tools '1.1.2'
torch '2.2.1+cu118'

OS is ubuntu 22.04

I have tested ddp_notebook_find_unused_parameters_true
and it does not work either.
In fact, providing a strategy parameter causes it to fail.

Running on 1 gpu works fine. Sometime last year I was able to run on two gpus, I don't remember the exact time.

The text was updated successfully, but these errors were encountered:

martinkim0 · 2024-03-06T18:25:14Z

Hi, sorry you're running into this issue. Did you happen to try running multi-GPU training outside of the notebook? Does it work then or is it the same error?

chem1kal1 · 2024-03-06T19:03:07Z

It is the same error when running it in a .py file.

martinkim0 · 2024-03-19T19:44:09Z

Apologies for the delay. Is there a different error when passing in a strategy parameter? Could you try passing in the DDPStrategy using the spawn start method?

Bo-UT · 2024-03-22T18:22:11Z

Got the exactly same error here

chem1kal1 added the bug label Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to train on multiple GPUs in jupyter notebook #2602

Failing to train on multiple GPUs in jupyter notebook #2602

chem1kal1 commented Mar 6, 2024 •

edited

martinkim0 commented Mar 6, 2024

chem1kal1 commented Mar 6, 2024

martinkim0 commented Mar 19, 2024

Bo-UT commented Mar 22, 2024

Failing to train on multiple GPUs in jupyter notebook #2602

Failing to train on multiple GPUs in jupyter notebook #2602

Comments

chem1kal1 commented Mar 6, 2024 • edited

Versions:

martinkim0 commented Mar 6, 2024

chem1kal1 commented Mar 6, 2024

martinkim0 commented Mar 19, 2024

Bo-UT commented Mar 22, 2024

chem1kal1 commented Mar 6, 2024 •

edited