New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sub-workers exits without messages #692
Comments
I tried to go deeper and recover the errors causing exit. I ended up here:
Which made workers This is the command I launch the training.
Two CUDA GPUs are available. Single-GPU version (just setting CUDA_VISIBLE_DEVICES env variable) works well. |
Confirming that this error in my case too comes from |
@mahnerak I solved this by add num_workers=0. |
Thanks @GongZhengLi I don't think |
@mahnerak , did you solve it ? |
Not yet. Still waiting. |
🐛 Bug
I use the script as follow:
CUDA_VISIBLE_DEVICES="0, 1, 2, 3" metaseq-train --task streaming_language_modeling
data/pile-test/
--num-workers 4
--reset-dataloader
--vocab-filename ./vocab/gpt2-vocab.json
--merges-filename ./vocab/gpt2-merges.txt
--model-parallel-size 1
--ddp-backend fully_sharded
--task-ddp-backend fully_sharded
--criterion cross_entropy
--batch-size 8
--save-dir /checkpoints/lm_transformer_pile-00
--arch transformer_lm_gpt2_tiny --share-decoder-input-output-embed
--dropout 0.1
--optimizer adam --weight-decay 0.01 --clip-norm 0.0
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07
--tokens-per-sample 1024 --sample-break-mode none --fp16
--use-sharded-state
--decoder-learned-pos
--log-format json
--log-interval 1
The rank 1, 2, 3 was exit before the loop of train_step. I print the every detailed log and find that the iter() inside more_itertools.peekable() kill all the non-master processes.
What's the matter with this ?
The text was updated successfully, but these errors were encountered: