Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The given group does not exist pytorch #379

Open
germanjke opened this issue Apr 25, 2023 · 2 comments
Open

The given group does not exist pytorch #379

germanjke opened this issue Apr 25, 2023 · 2 comments

Comments

@germanjke
Copy link

Do you know why i got this problem with pretrain_gpt_single_node.sh?
I'm setting N_GPUS=1
and got

File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 191, in _get_group_rank
    raise RuntimeError("The given group does not exist")
RuntimeError: The given group does not exist

from

Megatron-DeepSpeed/megatron/training.py", line 400, in setup_model_and_optimizer
    model = get_model(model_provider_func)

i'm using NCG docker with pytorch and apex, deepspeed and other packages installed from you requirements.txt

my setup is 2x 3090

@germanjke germanjke changed the title The given group does not exist The given group does not exist pytorch Apr 25, 2023
@LYF915
Copy link

LYF915 commented May 25, 2023

I also encountered this problem, did you solve the problem?

@zql022
Copy link

zql022 commented Oct 24, 2023

me too, how did you solved this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants