Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to train on multiple GPUs in jupyter notebook #2602

Open
chem1kal1 opened this issue Mar 6, 2024 · 4 comments
Open

Failing to train on multiple GPUs in jupyter notebook #2602

chem1kal1 opened this issue Mar 6, 2024 · 4 comments
Labels

Comments

@chem1kal1
Copy link

chem1kal1 commented Mar 6, 2024

After running SCVI.setup_anndata and creating the model, when training the model with devices != 1 (ex devices = 2) an error occurs:

RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

import your h5ad data,

scvi.model.SCVI.setup_anndata(adata, layer = "counts", categorical_covariate_keys=['Sample'],
                             continuous_covariate_keys=['pct_counts_mt', 'total_counts', 'pct_counts_ribo'])
model = scvi.model.SCVI(adata)
model.train(devices = 2)
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Versions:

scvi-tools '1.1.2'
torch '2.2.1+cu118'

OS is ubuntu 22.04

I have tested ddp_notebook_find_unused_parameters_true
and it does not work either.
In fact, providing a strategy parameter causes it to fail.

Running on 1 gpu works fine. Sometime last year I was able to run on two gpus, I don't remember the exact time.

@chem1kal1 chem1kal1 added the bug label Mar 6, 2024
@martinkim0
Copy link
Contributor

Hi, sorry you're running into this issue. Did you happen to try running multi-GPU training outside of the notebook? Does it work then or is it the same error?

@chem1kal1
Copy link
Author

It is the same error when running it in a .py file.

@martinkim0
Copy link
Contributor

Apologies for the delay. Is there a different error when passing in a strategy parameter? Could you try passing in the DDPStrategy using the spawn start method?

@Bo-UT
Copy link

Bo-UT commented Mar 22, 2024

Got the exactly same error here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants