Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception: cuda rng state model-parallel-rng is not added #369

Open
520jefferson opened this issue Mar 6, 2023 · 1 comment
Open

Exception: cuda rng state model-parallel-rng is not added #369

520jefferson opened this issue Mar 6, 2023 · 1 comment

Comments

@520jefferson
Copy link

i start the job the i met this error:
cuda: 12.0
torch: 1.14

deepspeed --num_gpus 2 pretrain_gpt_v2.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --distributed-backend nccl --num-layers 2 --hidden-size 64 --num-attention-heads 2 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 1 --rampup-batch-size 2 2 1_000 --global-batch-size 16 --train-samples 100 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-8 --lr 1e-4 --lr-warmup-samples 5 --clip-grad 1.0 --weight-decay 1e-1 --vocab-file /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-vocab.json --merge-file /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-merges.txt --fp16 --log-interval 10 --save-interval 100 --eval-interval 100 --eval-iters 10 --checkpoint-activations --save alibi_test --load alibi_test --data-path /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/BookCorpusDataset_text_document --tensorboard-dir output_dir_tensorboard --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --deepspeed --deepspeed_config ./ds_config.json --zero-stage 1 --deepspeed-activation-checkpointing
[2023-03-06 04:10:57,331] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-06 04:10:57,454] [INFO] [runner.py:548:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None pretrain_gpt_v2.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --distributed-backend nccl --num-layers 2 --hidden-size 64 --num-attention-heads 2 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 1 --rampup-batch-size 2 2 1_000 --global-batch-size 16 --train-samples 100 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-8 --lr 1e-4 --lr-warmup-samples 5 --clip-grad 1.0 --weight-decay 1e-1 --vocab-file /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-vocab.json --merge-file /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-merges.txt --fp16 --log-interval 10 --save-interval 100 --eval-interval 100 --eval-iters 10 --checkpoint-activations --save alibi_test --load alibi_test --data-path /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/BookCorpusDataset_text_document --tensorboard-dir output_dir_tensorboard --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --deepspeed --deepspeed_config ./ds_config.json --zero-stage 1 --deepspeed-activation-checkpointing
[2023-03-06 04:10:59,741] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.16.5
[2023-03-06 04:10:59,741] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-03-06 04:10:59,741] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-03-06 04:10:59,741] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-03-06 04:10:59,741] [INFO] [launch.py:162:main] dist_world_size=2
[2023-03-06 04:10:59,741] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1
using world size: 2, data-parallel-size: 2, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
using torch.float16 for parameters ...
------------------------ arguments ------------------------
abort_on_unmet_fused_kernel_constraints ......... False
accumulate_allreduce_grads_in_fp32 .............. False
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.95
adam_eps ........................................ 1e-08
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... False
bert_binary_head ................................ True
bert_load ....................................... None
bf16 ............................................ False
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
checkpoint_activations .......................... True
checkpoint_in_cpu ............................... False
checkpoint_num_layers ........................... 1
clip_grad ....................................... 1.0
codecarbon_dir .................................. None
consumed_train_samples .......................... 0
consumed_train_tokens ........................... 0
consumed_valid_samples .......................... 0
contigious_checkpointing ........................ False
cpu_optimizer ................................... False
cpu_torch_adam .................................. False
curriculum_learning ............................. False
data_impl ....................................... infer
data_parallel_size .............................. 2
data_path ....................................... ['/mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/BookCorpusDataset_text_document']
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_seq_length .............................. None
deepscale ....................................... False
deepscale_config ................................ None
deepspeed ....................................... True
deepspeed_activation_checkpointing .............. True
deepspeed_config ................................ ./ds_config.json
deepspeed_mpi ................................... False
distribute_checkpointed_activations ............. False
distributed_backend ............................. nccl
embed_layernorm ................................. False
embedding_path .................................. None
encoder_seq_length .............................. 1024
eod_mask_loss ................................... False
eval_interval ................................... 100
eval_iters ...................................... 10
eval_only ....................................... None
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
ffn_hidden_size ................................. 256
finetune ........................................ False
fp16 ............................................ True
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
gigaflos_no_embeds .............................. 0
global_batch_size ............................... 16
glu_activation .................................. None
hidden_dropout .................................. 0.1
hidden_size ..................................... 64
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_dim ......................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference ....................................... False
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
kill_switch_path ................................ None
kv_channels ..................................... 32
layernorm_epsilon ............................... 1e-05
lazy_mpu_init ................................... None
load ............................................ alibi_test
local_rank ...................................... 0
log_batch_size_to_tensorboard ................... True
log_interval .................................... 10
log_learning_rate_to_tensorboard ................ True
log_level ....................................... None
log_level_replica ............................... None
log_loss_scale_to_tensorboard ................... True
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_path ........................................ None
log_timers_to_tensorboard ....................... True
log_validation_ppl_to_tensorboard ............... True
loss_on_targets_only ............................ False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 0.0001
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. linear
lr_decay_tokens ................................. None
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 5
make_vocab_size_divisible_by .................... 128
mask_prob ....................................... 0.15
masked_softmax_fusion ........................... True
max_position_embeddings ......................... 1024
mean_noise_span_length .......................... None
memory_centric_tiled_linear ..................... False
merge_file ...................................... /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-merges.txt
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 0.0
mmap_warmup ..................................... False
no_load_optim ................................... None
no_load_rng ..................................... None
no_save_optim ................................... None
no_save_rng ..................................... None
noise_density ................................... None
num_attention_heads ............................. 2
num_channels .................................... 3
num_classes ..................................... 1000
num_layers ...................................... 2
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
override_lr_scheduler ........................... False
pad_vocab_size_to ............................... None
params_dtype .................................... torch.float16
partition_activations ........................... False
patch_dim ....................................... 16
pipeline_model_parallel_size .................... 1
position_embedding_type ......................... PositionEmbeddingType.absolute
pp_partition_method ............................. None
profile_backward ................................ False
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... ['2', '2', '1_000']
rank ............................................ 0
remote_device ................................... none
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
reweight_loss_based_on_position_frequency ....... False
sample_rate ..................................... 1.0
save ............................................ alibi_test
save_interval ................................... 100
scatter_gather_tensors_in_pipeline .............. True
scattered_embeddings ............................ False
seed ............................................ 1234
seq_length ...................................... 1024
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
skip_train_iteration_range ...................... None
split ........................................... 969, 30, 1
split_transformers .............................. False
sync_tp_duplicated_parameters ................... False
synchronize_each_layer .......................... False
tensor_model_parallel_size ...................... 1
tensorboard_dir ................................. output_dir_tensorboard
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 5
test_weighted_split_names ....................... None
test_weighted_split_paths ....................... None
test_weighted_split_paths_path .................. None
test_weighted_split_splits ...................... None
test_weighted_split_weights ..................... None
tile_factor ..................................... 1
titles_data_path ................................ None
tokenizer_name_or_path .......................... None
tokenizer_type .................................. GPT2BPETokenizer
train_iters ..................................... None
train_samples ................................... 100
train_tokens .................................... None
train_weighted_split_paths ...................... None
train_weighted_split_paths_path ................. None
universal_checkpoint ............................ False
use_bnb_optimizer ............................... False
use_checkpoint_lr_scheduler ..................... False
use_contiguous_buffers_in_ddp ................... False
use_cpu_initialization .......................... None
use_one_sent_docs ............................... False
use_pin_memory .................................. False
valid_num_workers ............................... 2
valid_weighted_split_names ...................... None
valid_weighted_split_paths ...................... None
valid_weighted_split_paths_path ................. None
valid_weighted_split_splits ..................... None
valid_weighted_split_weights .................... None
virtual_pipeline_model_parallel_size ............ None
vocab_extra_ids ................................. 0
vocab_file ...................................... /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-vocab.json
weight_decay .................................... 0.1
world_size ...................................... 2
zero_allgather_bucket_size ...................... 0.0
zero_contigious_gradients ....................... False
zero_reduce_bucket_size ......................... 0.0
zero_reduce_scatter ............................. False
zero_stage ...................................... 1
-------------------- end of arguments ---------------------
will use batch size rampup starting from global batch size 2 to global batch size 16 with batch size increments 2 over 1000 samples.

building GPT2BPETokenizer tokenizer ...
padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+44dac51
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.8.1, unknown, unknown
torch cuda version ............... 12.0
torch hip version ................ None
nvcc version ..................... 12.0
deepspeed wheel compiled w. ...... torch 1.14, cuda 12.0
setting tensorboard ...
**** Git info for Megatron: git_hash=e52bdab git_branch=main ****
initializing torch distributed ...
[2023-03-06 04:11:06,233] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
initializing tensor model parallel with size 1
initializing pipeline model parallel with size 1
setting random seeds to 1234 ...
compiling dataset index builder ...
make: Entering directory '/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/data'

done with dataset index builder. Compilation time: 0.107 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
done with compiling and loading fused kernels. Compilation time: 2.865 seconds
time to initialize megatron (seconds): -33.941
[after megatron is initialized] datetime: 2023-03-06 04:11:10
building GPT model ...
args.deepspeed: True
args.deepspeed_config: ./ds_config.json
args.deepspeed:
goes deepspeed ................
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1}
[2023-03-06 04:11:10,122] [INFO] [module.py:370:_partition_layers] Partitioning pipeline stages with method type:transformer
stage=0 layers=9
0: _to_float16
1: EmbeddingPipe
2:
3: ParallelTransformerLayerPipe
4: ParallelTransformerLayerPipe
5: undo
6: MixedFusedLayerNorm
7: EmbeddingPipe
8: float16_to_fp32
loss: CrossEntropy
Traceback (most recent call last):
File "pretrain_gpt_v2.py", line 243, in
Traceback (most recent call last):
File "pretrain_gpt_v2.py", line 243, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "pretrain_gpt_v2.py", line 238, in main
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "pretrain_gpt_v2.py", line 238, in main
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 401, in setup_model_and_optimizer
model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 401, in setup_model_and_optimizer
model = get_model(model_provider_func)
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 269, in get_model
model = get_model(model_provider_func)
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 269, in get_model
model = model_provider_func(
File "pretrain_gpt_v2.py", line 63, in model_provider
model = model_provider_func(
File "pretrain_gpt_v2.py", line 63, in model_provider
model = GPTModelPipe(
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/model/gpt_model.py", line 315, in init
model = GPTModelPipe(
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/model/gpt_model.py", line 315, in init
super().init(layers=self.specs,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 203, in init
super().init(layers=self.specs,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 203, in init
self._build()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 238, in _build
self._build()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 238, in _build
self.tied_modules[layer.key] = layer.build()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 69, in build
self.tied_modules[layer.key] = layer.build()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 69, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/model/language_model.py", line 131, in init
return self.typename(*self.module_args, **self.module_kwargs)
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/model/language_model.py", line 131, in init
self.word_embeddings = mpu.VocabParallelEmbedding(
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/mpu/layers.py", line 213, in init
self.word_embeddings = mpu.VocabParallelEmbedding(
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/mpu/layers.py", line 213, in init
_initialize_affine_weight_gpu(self.weight, init_method,
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/mpu/layers.py", line 95, in _initialize_affine_weight_gpu
_initialize_affine_weight_gpu(self.weight, init_method,
File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/mpu/layers.py", line 95, in _initialize_affine_weight_gpu
with get_cuda_rng_tracker().fork():
File "/usr/lib/python3.8/contextlib.py", line 113, in enter
with get_cuda_rng_tracker().fork():
File "/usr/lib/python3.8/contextlib.py", line 113, in enter
return next(self.gen)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 174, in fork
return next(self.gen)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 174, in fork
raise Exception('cuda rng state {} is not added'.format(name))
raise Exception('cuda rng state {} is not added'.format(name))
Exception: cuda rng state model-parallel-rng is not added
Exception: cuda rng state model-parallel-rng is not added
[2023-03-06 04:11:11,771] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 866
[2023-03-06 04:11:11,771] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 867
[2023-03-06 04:11:11,772] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python', '-u', 'pretrain_gpt_v2.py', '--local_rank=1', '--tensor-model-parallel-size', '1', '--pipeline-model-parallel-size', '1', '--distributed-backend', 'nccl', '--num-layers', '2', '--hidden-size', '64', '--num-attention-heads', '2', '--seq-length', '1024', '--max-position-embeddings', '1024', '--micro-batch-size', '1', '--rampup-batch-size', '2', '2', '1_000', '--global-batch-size', '16', '--train-samples', '100', '--optimizer', 'adam', '--adam-beta1', '0.9', '--adam-beta2', '0.95', '--adam-eps', '1e-8', '--lr', '1e-4', '--lr-warmup-samples', '5', '--clip-grad', '1.0', '--weight-decay', '1e-1', '--vocab-file', '/mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-vocab.json', '--merge-file', '/mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-merges.txt', '--fp16', '--log-interval', '10', '--save-interval', '100', '--eval-interval', '100', '--eval-iters', '10', '--checkpoint-activations', '--save', 'alibi_test', '--load', 'alibi_test', '--data-path', '/mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/BookCorpusDataset_text_document', '--tensorboard-dir', 'output_dir_tensorboard', '--tensorboard-queue-size', '5', '--log-timers-to-tensorboard', '--log-batch-size-to-tensorboard', '--log-validation-ppl-to-tensorboard', '--deepspeed', '--deepspeed_config', './ds_config.json', '--zero-stage', '1', '--deepspeed-activation-checkpointing'] exits with return code = 1

@XaviLv
Copy link

XaviLv commented Aug 19, 2023

See here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants