Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Timed out initializing process group in store based barrier on rank 2 #3626

Open
SingL3 opened this issue Aug 2, 2023 · 2 comments
Labels

Comments

@SingL3
Copy link
Contributor

SingL3 commented Aug 2, 2023

I am trying to run pretrain of LLaMA 30b. And here is my running cmd:

deepspeed trainer_sft.py --configs defaults llama-30b-pretrain pretrain --cache_dir $DATA_PATH --output_dir $MODEL_PATH/llama-30b-pre --deepspeed

And after the model was loaded, it stucked for a long time(I think it was 30 mins for the default timeout of pytorch is 30mins).
And this error is raised:

RuntimeError: Timed out initializing process group in store based barrier on rank 2 # for all rank

Any solutions?

@olliestanley olliestanley added the ml label Aug 6, 2023
@andreaskoepf
Copy link
Collaborator

We have not seen this error during our training runs. Could you try smaller/different models first? Are you using the latest version of deepspeed? Which GPU and cuda version are you using? Do you have access to a different machine on which you could cross-check?

@SingL3
Copy link
Contributor Author

SingL3 commented Aug 8, 2023

@andreaskoepf
Yes, at least latest version last week and deepspeed.
I am using 8xA100(80G) with cuda 11.7.
I have tried reducing pretrain datasets here(only alpaca_gpt4 is reserved) and it can run successfully so I dont think it is the reason of the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants