Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Padding during training results in a "Killed" #64

Open
MKSharaf opened this issue Nov 12, 2023 · 0 comments
Open

Padding during training results in a "Killed" #64

MKSharaf opened this issue Nov 12, 2023 · 0 comments

Comments

@MKSharaf
Copy link

I'm using Colab, and I was only using CommensenseConversation as my dataset, everything was going fine until it started padding. For some reason padding stopped mid way and resulted in a "Killed" state. What could be the cause for this? Here is the output/logs.

/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING]
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING] *****************************************
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-11-12 22:40:53,080] torch.distributed.run: [WARNING] *****************************************
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
OPENAI_LOGDIR=diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 TOKENIZERS_PARALLELISM=false python train.py --checkpoint_path diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53 --dataset qqp --data_dir datasets/CC --vocab bert --use_plm_init no --lr 0.0001 --batch_size 2048 --microbatch 64 --diffusion_steps 2000 --noise_schedule sqrt --schedule_sampler lossaware --resume_checkpoint none --seq_len 128 --hidden_t_dim 128 --seed 102 --hidden_dim 128 --learning_steps 50000 --save_interval 10000 --config_name bert-base-uncased --notes test-qqp20231112-22:40:53
2023-11-12 22:41:06.178406: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.178464: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.178509: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.199910: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.199966: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.200004: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.217535: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:06.234851: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:06.370428: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.370482: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.370521: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.393033: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:06.547999: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 22:41:06.550180: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 22:41:06.550241: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 22:41:06.597710: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-12 22:41:10.572711: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-12 22:41:10.665962: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-12 22:41:10.727451: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-12 22:41:10.986807: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
File "/content/DiffuSeq/train.py", line 115, in
main()
File "/content/DiffuSeq/train.py", line 37, in main
dist_util.setup_dist()
File "/content/DiffuSeq/diffuseq/utils/dist_util.py", line 41, in setup_dist
th.cuda.set_device(dev())
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/content/DiffuSeq/train.py", line 115, in
main()
File "/content/DiffuSeq/train.py", line 37, in main
dist_util.setup_dist()
File "/content/DiffuSeq/diffuseq/utils/dist_util.py", line 41, in setup_dist
th.cuda.set_device(dev())
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/content/DiffuSeq/train.py", line 115, in
main()
File "/content/DiffuSeq/train.py", line 37, in main
dist_util.setup_dist()
File "/content/DiffuSeq/diffuseq/utils/dist_util.py", line 41, in setup_dist
th.cuda.set_device(dev())
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Logging to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20231112-22:40:53

Creating data loader...

(…)cased/resolve/main/tokenizer_config.json: 100% 28.0/28.0 [00:00<00:00, 116kB/s]
(…)rt-base-uncased/resolve/main/config.json: 100% 570/570 [00:00<00:00, 2.54MB/s]
(…)bert-base-uncased/resolve/main/vocab.txt: 100% 232k/232k [00:00<00:00, 5.43MB/s]
(…)base-uncased/resolve/main/tokenizer.json: 100% 466k/466k [00:00<00:00, 7.70MB/s]
initializing the random embeddings Embedding(30522, 128)
##############################
Loading text data...
##############################
Loading dataset qqp from datasets/CC...

Loading form the TRAIN set...

Data samples...

['jesus , what kind of concerts do you go to where people sucker punch you for being born tall ?', 'almost all of those sound awful . dr . ken sounds like it could be good , but that description is too vague to really tell anything . in chang we trust , or something .'] ['the kind that allow bitter short people in . so basically all of them .', "if he 's anything like his knocked up character i 'm sure it 'll be pretty funny ."]
RAM used: 1986.24 MB
Dataset({
features: ['src', 'trg'],
num_rows: 3382137
})
RAM used: 2643.07 MB
Running tokenizer on dataset (num_proc=4): 100% 3382137/3382137 [12:08<00:00, 4643.12 examples/s]

tokenized_datasets Dataset({

features: ['input_id_x', 'input_id_y'],
num_rows: 3382137

})

tokenized_datasets...example [101, 4441, 1010, 2054, 2785, 1997, 6759, 2079, 2017, 2175, 2000, 2073, 2111, 26476, 8595, 2017, 2005, 2108, 2141, 4206, 1029, 102]

RAM used: 4182.97 MB
merge and mask: 100% 3382137/3382137 [02:41<00:00, 20891.13 examples/s]
RAM used: 6818.71 MB
padding: 65% 2207000/3382137 [03:08<01:16, 15381.54 examples/s]Killed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant