Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache problem while runing on multiple nodes with GPU #30859

Open
2 of 4 tasks
yuane4 opened this issue May 16, 2024 · 3 comments
Open
2 of 4 tasks

Cache problem while runing on multiple nodes with GPU #30859

yuane4 opened this issue May 16, 2024 · 3 comments

Comments

@yuane4
Copy link

yuane4 commented May 16, 2024

System Info

Hi,

I am currently trying to use the script run_mlm_wwm.py to perform a continual pretrianing on the Whole Word mazking task, on a Bert model, my problem occured when I am trying to use multpiles GPU, when the number of GPU force my server to use severals nodes then I get an error.
I think there's a locking problem linked to the parallel file system when I go over several nodes, I would have to set a different cache for each process (for example by adding the global rank) or for each process on the same node but i don't succeed to do it so far

Is there the error message i get :

`Loading pytorch-gpu/py3/2.1.1
Loading requirement: cuda/11.8.0 nccl/2.18.5-1-cuda cudnn/8.7.0.84-cuda
gcc/8.5.0 openmpi/4.1.5-cuda intel-mkl/2020.4 magma/2.7.1-cuda sox/14.4.2
sparsehash/2.0.3 libjpeg-turbo/2.1.3 ffmpeg/4.4.4

  • export OMP_NUM_THREADS=10
  • OMP_NUM_THREADS=10
  • export CUDA_LAUNCH_BLOCKING=1
  • CUDA_LAUNCH_BLOCKING=1
  • export NCCL_ASYNC_ERROR_HANDLING=1
  • NCCL_ASYNC_ERROR_HANDLING=1
  • export TRAIN_FILE=/textes_shuffle_txt/train_data_shuffle_2020a2022.txt
  • TRAIN_FILE=/textes_shuffle_txt/train_data_shuffle_2020a2022.txt
  • export VALIDATION_FILE=/textes_shuffle_txt/test_data_shuffle_2020a2022.txt
  • VALIDATION_FILE=/textes_shuffle_txt/test_data_shuffle_2020a2022.txt
  • export OUTPUT_DIR=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm
  • OUTPUT_DIR=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm
  • srun -l python -u run_mlm_wwm.py --model_name_or_path my_path_to_files/camembert-base --train_file my_path_to_files/textes_shuffle_txt/train_data_shuffle_2020a2022.txt --validation_file my_path_to_files/textes_shuffle_txt/test_data_shuffle_2020a2022.txt --per_device_train_batch_size=96 --do_train --warmup_steps=10000 --overwrite_output_dir --max_seq_length=512 --logging_steps=500 --report_to=tensorboard --save_strategy=epoch --skip_memory_metrics=False --log_level=info --logging_first_step=True --learning_rate 1e-4 --num_train_epochs 6.0 --fp16 --output_dir /gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm --do_eval --pad_to_max_length True --preprocessing_num_workers 8 --ddp_timeout=600 --ddp_find_unused_parameters=False
    srun: warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 18 with the number of requested nodes 3. Ignoring --ntasks-per-node.
    0: comet_ml is installed but COMET_API_KEY is not set.
    6: comet_ml is installed but COMET_API_KEY is not set.
    7: comet_ml is installed but COMET_API_KEY is not set.
    8: comet_ml is installed but COMET_API_KEY is not set.
    9: comet_ml is installed but COMET_API_KEY is not set.
    10: comet_ml is installed but COMET_API_KEY is not set.
    11: comet_ml is installed but COMET_API_KEY is not set.
    12: comet_ml is installed but COMET_API_KEY is not set.
    13: comet_ml is installed but COMET_API_KEY is not set.
    14: comet_ml is installed but COMET_API_KEY is not set.
    15: comet_ml is installed but COMET_API_KEY is not set.
    16: comet_ml is installed but COMET_API_KEY is not set.
    17: comet_ml is installed but COMET_API_KEY is not set.
    3: comet_ml is installed but COMET_API_KEY is not set.
    4: comet_ml is installed but COMET_API_KEY is not set.
    5: comet_ml is installed but COMET_API_KEY is not set.
    1: comet_ml is installed but COMET_API_KEY is not set.
    2: comet_ml is installed but COMET_API_KEY is not set.
    2: 05/04/2024 06:14:12 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True
    1: 05/04/2024 06:14:12 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
    3: 05/04/2024 06:14:12 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True
    4: 05/04/2024 06:14:12 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: True
    5: 05/04/2024 06:14:12 - WARNING - main - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: True
    14: 05/04/2024 06:14:12 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True
    15: 05/04/2024 06:14:12 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True
    16: 05/04/2024 06:14:12 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: True
    12: 05/04/2024 06:14:12 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
    13: 05/04/2024 06:14:12 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
    17: 05/04/2024 06:14:12 - WARNING - main - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: True
    12: 05/04/2024 06:14:12 - INFO - main - Training/evaluation parameters TrainingArguments(
    12: _n_gpu=1,
    12: adafactor=False,
    12: adam_beta1=0.9,
    12: adam_beta2=0.999,
    12: adam_epsilon=1e-08,
    12: auto_find_batch_size=False,
    12: bf16=False,
    12: bf16_full_eval=False,
    12: data_seed=None,
    12: dataloader_drop_last=False,
    12: dataloader_num_workers=0,
    12: dataloader_pin_memory=True,
    12: ddp_backend=None,
    12: ddp_broadcast_buffers=None,
    12: ddp_bucket_cap_mb=None,
    12: ddp_find_unused_parameters=False,
    12: ddp_timeout=600,
    12: debug=[],
    12: deepspeed=None,
    12: disable_tqdm=False,
    12: dispatch_batches=None,
    12: do_eval=True,
    12: do_predict=False,
    12: do_train=True,
    12: eval_accumulation_steps=None,
    12: eval_delay=0,
    12: eval_steps=None,
    12: evaluation_strategy=IntervalStrategy.NO,
    12: fp16=True,
    12: fp16_backend=auto,
    12: fp16_full_eval=False,
    12: fp16_opt_level=O1,
    12: fsdp=[],
    12: fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
    12: fsdp_min_num_params=0,
    12: fsdp_transformer_layer_cls_to_wrap=None,
    12: full_determinism=False,
    12: gradient_accumulation_steps=1,
    12: gradient_checkpointing=False,
    12: gradient_checkpointing_kwargs=None,
    12: greater_is_better=None,
    12: group_by_length=False,
    12: half_precision_backend=auto,
    12: hub_always_push=False,
    12: hub_model_id=None,
    12: hub_private_repo=False,
    12: hub_strategy=HubStrategy.EVERY_SAVE,
    12: hub_token=<HUB_TOKEN>,
    12: ignore_data_skip=False,
    12: include_inputs_for_metrics=False,
    12: include_tokens_per_second=False,
    12: jit_mode_eval=False,
    12: label_names=None,
    12: label_smoothing_factor=0.0,
    12: learning_rate=0.0001,
    12: length_column_name=length,
    12: load_best_model_at_end=False,
    12: local_rank=0,
    12: log_level=info,
    12: log_level_replica=warning,
    12: log_on_each_node=True,
    12: logging_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm/runs/May04_06-14-12_jean-zay-iam07,
    12: logging_first_step=True,
    12: logging_nan_inf_filter=True,
    12: logging_steps=500,
    12: logging_strategy=IntervalStrategy.STEPS,
    12: lr_scheduler_type=SchedulerType.LINEAR,
    12: max_grad_norm=1.0,
    12: max_steps=-1,
    12: metric_for_best_model=None,
    12: mp_parameters=,
    12: neftune_noise_alpha=None,
    12: no_cuda=False,
    12: num_train_epochs=6.0,
    12: optim=OptimizerNames.ADAMW_TORCH,
    12: optim_args=None,
    12: output_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
    12: overwrite_output_dir=True,
    12: past_index=-1,
    12: per_device_eval_batch_size=8,
    12: per_device_train_batch_size=96,
    12: prediction_loss_only=False,
    12: push_to_hub=False,
    12: push_to_hub_model_id=None,
    12: push_to_hub_organization=None,
    12: push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
    12: ray_scope=last,
    12: remove_unused_columns=True,
    12: report_to=['tensorboard'],
    12: resume_from_checkpoint=None,
    12: run_name=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
    12: save_on_each_node=False,
    12: save_safetensors=True,
    12: save_steps=500,
    12: save_strategy=IntervalStrategy.EPOCH,
    12: save_total_limit=None,
    12: seed=42,
    12: skip_memory_metrics=False,
    12: split_batches=False,
    12: tf32=None,
    12: torch_compile=False,
    12: torch_compile_backend=None,
    12: torch_compile_mode=None,
    12: torchdynamo=None,
    12: tpu_metrics_debug=False,
    12: tpu_num_cores=None,
    12: use_cpu=False,
    12: use_ipex=False,
    12: use_legacy_prediction_loop=False,
    12: use_mps_device=False,
    12: warmup_ratio=0.0,
    12: warmup_steps=10000,
    12: weight_decay=0.0,
    12: )
    Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 21454.24it/s]
    Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 11259.88it/s]
    0: 05/04/2024 06:14:13 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
    0: 05/04/2024 06:14:13 - INFO - main - Training/evaluation parameters TrainingArguments(
    0: _n_gpu=1,
    0: adafactor=False,
    0: adam_beta1=0.9,
    0: adam_beta2=0.999,
    0: adam_epsilon=1e-08,
    0: auto_find_batch_size=False,
    0: bf16=False,
    0: bf16_full_eval=False,
    0: data_seed=None,
    0: dataloader_drop_last=False,
    0: dataloader_num_workers=0,
    0: dataloader_pin_memory=True,
    0: ddp_backend=None,
    0: ddp_broadcast_buffers=None,
    0: ddp_bucket_cap_mb=None,
    0: ddp_find_unused_parameters=False,
    0: ddp_timeout=600,
    0: debug=[],
    0: deepspeed=None,
    0: disable_tqdm=False,
    0: dispatch_batches=None,
    0: do_eval=True,
    0: do_predict=False,
    0: do_train=True,
    0: eval_accumulation_steps=None,
    0: eval_delay=0,
    0: eval_steps=None,
    0: evaluation_strategy=IntervalStrategy.NO,
    0: fp16=True,
    0: fp16_backend=auto,
    0: fp16_full_eval=False,
    0: fp16_opt_level=O1,
    0: fsdp=[],
    0: fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
    0: fsdp_min_num_params=0,
    0: fsdp_transformer_layer_cls_to_wrap=None,
    0: full_determinism=False,
    0: gradient_accumulation_steps=1,
    0: gradient_checkpointing=False,
    0: gradient_checkpointing_kwargs=None,
    0: greater_is_better=None,
    0: group_by_length=False,
    0: half_precision_backend=auto,
    0: hub_always_push=False,
    0: hub_model_id=None,
    0: hub_private_repo=False,
    0: hub_strategy=HubStrategy.EVERY_SAVE,
    0: hub_token=<HUB_TOKEN>,
    0: ignore_data_skip=False,
    0: include_inputs_for_metrics=False,
    0: include_tokens_per_second=False,
    0: jit_mode_eval=False,
    0: label_names=None,
    0: label_smoothing_factor=0.0,
    0: learning_rate=0.0001,
    0: length_column_name=length,
    0: load_best_model_at_end=False,
    0: local_rank=0,
    0: log_level=info,
    0: log_level_replica=warning,
    0: log_on_each_node=True,
    0: logging_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm/runs/May04_06-14-13_jean-zay-iam05,
    0: logging_first_step=True,
    0: logging_nan_inf_filter=True,
    0: logging_steps=500,
    0: logging_strategy=IntervalStrategy.STEPS,
    0: lr_scheduler_type=SchedulerType.LINEAR,
    0: max_grad_norm=1.0,
    0: max_steps=-1,
    0: metric_for_best_model=None,
    0: mp_parameters=,
    0: neftune_noise_alpha=None,
    0: no_cuda=False,
    0: num_train_epochs=6.0,
    0: optim=OptimizerNames.ADAMW_TORCH,
    0: optim_args=None,
    0: output_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
    0: overwrite_output_dir=True,
    0: past_index=-1,
    0: per_device_eval_batch_size=8,
    0: per_device_train_batch_size=96,
    0: prediction_loss_only=False,
    0: push_to_hub=False,
    0: push_to_hub_model_id=None,
    0: push_to_hub_organization=None,
    0: push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
    0: ray_scope=last,
    0: remove_unused_columns=True,
    0: report_to=['tensorboard'],
    0: resume_from_checkpoint=None,
    0: run_name=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
    0: save_on_each_node=False,
    0: save_safetensors=True,
    0: save_steps=500,
    0: save_strategy=IntervalStrategy.EPOCH,
    0: save_total_limit=None,
    0: seed=42,
    0: skip_memory_metrics=False,
    0: split_batches=False,
    0: tf32=None,
    0: torch_compile=False,
    0: torch_compile_backend=None,
    0: torch_compile_mode=None,
    0: torchdynamo=None,
    0: tpu_metrics_debug=False,
    0: tpu_num_cores=None,
    0: use_cpu=False,
    0: use_ipex=False,
    0: use_legacy_prediction_loop=False,
    0: use_mps_device=False,
    0: warmup_ratio=0.0,
    0: warmup_steps=10000,
    0: weight_decay=0.0,
    0: )
    7: 05/04/2024 06:14:13 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
    10: 05/04/2024 06:14:13 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: True
    9: 05/04/2024 06:14:13 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True
    11: 05/04/2024 06:14:13 - WARNING - main - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: True
    6: 05/04/2024 06:14:13 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
    8: 05/04/2024 06:14:13 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True
    6: 05/04/2024 06:14:13 - INFO - main - Training/evaluation parameters TrainingArguments(
    6: _n_gpu=1,
    6: adafactor=False,
    6: adam_beta1=0.9,
    6: adam_beta2=0.999,
    6: adam_epsilon=1e-08,
    6: auto_find_batch_size=False,
    6: bf16=False,
    6: bf16_full_eval=False,
    6: data_seed=None,
    6: dataloader_drop_last=False,
    6: dataloader_num_workers=0,
    6: dataloader_pin_memory=True,
    6: ddp_backend=None,
    6: ddp_broadcast_buffers=None,
    6: ddp_bucket_cap_mb=None,
    6: ddp_find_unused_parameters=False,
    6: ddp_timeout=600,
    6: debug=[],
    6: deepspeed=None,
    6: disable_tqdm=False,
    6: dispatch_batches=None,
    6: do_eval=True,
    6: do_predict=False,
    6: do_train=True,
    6: eval_accumulation_steps=None,
    6: eval_delay=0,
    6: eval_steps=None,
    6: evaluation_strategy=IntervalStrategy.NO,
    6: fp16=True,
    6: fp16_backend=auto,
    6: fp16_full_eval=False,
    6: fp16_opt_level=O1,
    6: fsdp=[],
    6: fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
    6: fsdp_min_num_params=0,
    6: fsdp_transformer_layer_cls_to_wrap=None,
    6: full_determinism=False,
    6: gradient_accumulation_steps=1,
    6: gradient_checkpointing=False,
    6: gradient_checkpointing_kwargs=None,
    6: greater_is_better=None,
    6: group_by_length=False,
    6: half_precision_backend=auto,
    6: hub_always_push=False,
    6: hub_model_id=None,
    6: hub_private_repo=False,
    6: hub_strategy=HubStrategy.EVERY_SAVE,
    6: hub_token=<HUB_TOKEN>,
    6: ignore_data_skip=False,
    6: include_inputs_for_metrics=False,
    6: include_tokens_per_second=False,
    6: jit_mode_eval=False,
    6: label_names=None,
    6: label_smoothing_factor=0.0,
    6: learning_rate=0.0001,
    6: length_column_name=length,
    6: load_best_model_at_end=False,
    6: local_rank=0,
    6: log_level=info,
    6: log_level_replica=warning,
    6: log_on_each_node=True,
    6: logging_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm/runs/May04_06-14-13_jean-zay-iam06,
    6: logging_first_step=True,
    6: logging_nan_inf_filter=True,
    6: logging_steps=500,
    6: logging_strategy=IntervalStrategy.STEPS,
    6: lr_scheduler_type=SchedulerType.LINEAR,
    6: max_grad_norm=1.0,
    6: max_steps=-1,
    6: metric_for_best_model=None,
    6: mp_parameters=,
    6: neftune_noise_alpha=None,
    6: no_cuda=False,
    6: num_train_epochs=6.0,
    6: optim=OptimizerNames.ADAMW_TORCH,
    6: optim_args=None,
    6: output_dir=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
    6: overwrite_output_dir=True,
    6: past_index=-1,
    6: per_device_eval_batch_size=8,
    6: per_device_train_batch_size=96,
    6: prediction_loss_only=False,
    6: push_to_hub=False,
    6: push_to_hub_model_id=None,
    6: push_to_hub_organization=None,
    6: push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
    6: ray_scope=last,
    6: remove_unused_columns=True,
    6: report_to=['tensorboard'],
    6: resume_from_checkpoint=None,
    6: run_name=/gpfsscratch/rech/khy/uvb95lb/test-mlm-wwm,
    6: save_on_each_node=False,
    6: save_safetensors=True,
    6: save_steps=500,
    6: save_strategy=IntervalStrategy.EPOCH,
    6: save_total_limit=None,
    6: seed=42,
    6: skip_memory_metrics=False,
    6: split_batches=False,
    6: tf32=None,
    6: torch_compile=False,
    6: torch_compile_backend=None,
    6: torch_compile_mode=None,
    6: torchdynamo=None,
    6: tpu_metrics_debug=False,
    6: tpu_num_cores=None,
    6: use_cpu=False,
    6: use_ipex=False,
    6: use_legacy_prediction_loop=False,
    6: use_mps_device=False,
    6: warmup_ratio=0.0,
    6: warmup_steps=10000,
    6: weight_decay=0.0,
    6: )
    Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 21183.35it/s]
    Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 7.98it/s]
    Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 7.37it/s]
    Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 26.94it/s]
    Generating train split: 735691 examples [00:02, 368808.23 examples/s]
    Generating train split: 674152 examples [00:02, 277129.32 examples/s]
    Generating train split: 674152 examples [00:03, 247846.23 examples/s]
    Generating train split: 1593705 examples [00:04, 420185.94 examples/s]
    Generating train split: 1501884 examples [00:05, 283821.86 examples/s]
    Generating train split: 1410002 examples [00:05, 375326.74 examples/s]
    Generating train split: 2513358 examples [00:06, 437244.14 examples/s]
    Generating train split: 2421559 examples [00:07, 378112.29 examples/s]
    Generating train split: 2329470 examples [00:08, 326257.49 examples/s]
    Generating train split: 3370011 examples [00:08, 448849.21 examples/s]
    Generating train split: 3278022 examples [00:10, 358206.71 examples/s]
    Generating train split: 3155413 examples [00:10, 341364.95 examples/s]
    Generating train split: 4290261 examples [00:11, 447049.42 examples/s]
    Generating train split: 4198175 examples [00:12, 373207.24 examples/s]
    Generating train split: 4074904 examples [00:12, 353087.99 examples/s]
    Generating train split: 5147893 examples [00:13, 417013.34 examples/s]
    Generating train split: 5056374 examples [00:14, 376059.59 examples/s]
    Generating train split: 6067697 examples [00:15, 448228.75 examples/s]s/s]
    Generating train split: 4933690 examples [00:15, 421004.81 examples/s]
    Generating train split: 6926173 examples [00:17, 450815.07 examples/s]
    Generating train split: 5975531 examples [00:17, 383399.07 examples/s]/s]
    Generating train split: 5852750 examples [00:17, 365422.88 examples/s]s]
    Generating train split: 7783637 examples [00:19, 467585.48 examples/s]
    Generating train split: 6712504 examples [00:19, 422087.08 examples/s]
    Generating train split: 6835032 examples [00:19, 295981.16 examples/s]
    Generating train split: 8703429 examples [00:21, 438801.32 examples/s]
    Generating train split: 7569889 examples [00:22, 405311.56 examples/s]
    Generating train split: 7691972 examples [00:22, 384627.19 examples/s]
    Generating train split: 9561456 examples [00:23, 430769.79 examples/s]
    Generating train split: 8611880 examples [00:24, 368744.37 examples/s]
    Generating train split: 8488889 examples [00:24, 309105.36 examples/s]
    Generating train split: 10481374 examples [00:25, 434503.57 examples/s]
    Generating train split: 9469794 examples [00:26, 366982.35 examples/s]
    Generating train split: 9347416 examples [00:26, 439195.31 examples/s]
    Generating train split: 11340042 examples [00:27, 446102.44 examples/s]
    Generating train split: 12198580 examples [00:29, 434616.45 examples/s]
    Generating train split: 10389743 examples [00:29, 398051.80 examples/s]
    Generating train split: 10267411 examples [00:29, 368192.47 examples/s]
    Generating train split: 13056612 examples [00:31, 439929.51 examples/s]
    Generating train split: 11247710 examples [00:31, 398261.78 examples/s]
    Generating train split: 11125153 examples [00:31, 419988.52 examples/s]
    Generating train split: 13913907 examples [00:33, 458949.37 examples/s]
    Generating train split: 12106665 examples [00:33, 380881.35 examples/s]
    Generating train split: 11983873 examples [00:33, 358349.17 examples/s]
    Generating train split: 14833307 examples [00:35, 423504.66 examples/s]
    Generating train split: 12964439 examples [00:35, 400086.35 examples/s]
    Generating train split: 12841924 examples [00:36, 400414.04 examples/s]
    Generating train split: 15690649 examples [00:37, 438872.86 examples/s]
    Generating train split: 13822319 examples [00:37, 398832.34 examples/s]
    Generating train split: 13769754 examples [00:38, 425513.63 examples/s]
    Generating train split: 16548238 examples [00:39, 450448.68 examples/s]
    Generating train split: 14741800 examples [00:40, 390752.76 examples/s]
    Generating train split: 14619130 examples [00:40, 395537.82 examples/s]
    Generating train split: 17405436 examples [00:41, 442940.32 examples/s]
    Generating train split: 15598726 examples [00:42, 356492.27 examples/s]
    Generating train split: 15476523 examples [00:42, 369237.01 examples/s]
    Generating train split: 18323721 examples [00:43, 459883.94 examples/s]
    Generating train split: 16456225 examples [00:44, 403418.79 examples/s]
    Generating train split: 19181845 examples [00:45, 444280.63 examples/s]
    Generating train split: 16334014 examples [00:44, 431093.34 examples/s]
    Generating train split: 20038448 examples [00:47, 438563.69 examples/s]
    Generating train split: 17312794 examples [00:47, 402221.80 examples/s]/s]
    Generating train split: 20130863 examples [00:47, 425298.18 examples/s]
    Generating train split: 17190731 examples [00:47, 352356.97 examples/s]
    Generating train split: 18231895 examples [00:49, 438039.86 examples/s]
    Generating validation split: 795560 examples [00:01, 450074.53 examples/s]
    Generating train split: 18109337 examples [00:49, 387274.11 examples/s]
    Generating train split: 19090119 examples [00:51, 431144.93 examples/s]
    Generating validation split: 1592633 examples [00:03, 447538.35 examples/s]
    Generating train split: 18967698 examples [00:51, 400959.93 examples/s]
    Generating train split: 19946306 examples [00:53, 444335.85 examples/s]
    Generating validation split: 2450319 examples [00:05, 437344.97 examples/s]
    Generating train split: 20130863 examples [00:53, 376271.19 examples/s]
    14: Traceback (most recent call last):
    14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 902, in incomplete_dir
    Generating train split: 19823979 examples [00:53, 427450.54 examples/s]
    14: yield tmp_dir
    14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 948, in download_and_prepare
    14: self._download_and_prepare(
    14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 1045, in _download_and_prepare
    14: raise OSError(
    14: OSError: Cannot find data file.
    14: Original error:
    14: [Errno 2] No such file or directory: '/scrip_continual_pretraining/Cache_mlm/text/default-d0870639fca1403e/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34.incomplete/text-train-00000-00000-of-NNNNN.arrow'
    14:
    14: During handling of the above exception, another exception occurred:
    14:
    14: Traceback (most recent call last):
    14: File "/gpfsdswork/projects/rech/khy/uvb95lb/scrip_continual_pretraining/run_mlm_wwm.py", line 450, in
    14: main()
    14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
    14: return f(*args, **kwargs)
    14: ^^^^^^^^^^^^^^^^^^
    14: File "/gpfsdswork/projects/rech/khy/uvb95lb/scrip_continual_pretraining/run_mlm_wwm.py", line 294, in main
    14: datasets = load_dataset(extension, data_files=data_files, cache_dir="/scrip_continual_pretraining/Cache_mlm")
    14: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/load.py", line 2152, in load_dataset
    Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 18117.94it/s]
    14: builder_instance.download_and_prepare(
    14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 928, in download_and_prepare
    14: with incomplete_dir(self._output_dir) as tmp_output_dir:
    14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/contextlib.py", line 155, in exit
    Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 72.83it/s]
    14: self.gen.throw(typ, value, traceback)
    14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 909, in incomplete_dir
    14: shutil.rmtree(tmp_dir)
    14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/shutil.py", line 738, in rmtree
    14: onerror(os.rmdir, path, sys.exc_info())
    14: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/shutil.py", line 736, in rmtree
    14: os.rmdir(path, dir_fd=dir_fd)
    14: OSError: [Errno 39] Directory not empty: '/scrip_continual_pretraining/Cache_mlm/text/default-d0870639fca1403e/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34.incomplete'
    Generating train split: 20130863 examples [00:54, 368921.36 examples/s]
    9: Traceback (most recent call last):
    9: File "/gpfsdswork/projects/rech/khy/uvb95lb/scrip_continual_pretraining/run_mlm_wwm.py", line 450, in
    9: main()
    9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
    9: return f(*args, **kwargs)
    9: ^^^^^^^^^^^^^^^^^^
    9: File "/gpfsdswork/projects/rech/khy/uvb95lb/scrip_continual_pretraining/run_mlm_wwm.py", line 294, in main
    9: datasets = load_dataset(extension, data_files=data_files, cache_dir="/scrip_continual_pretraining/Cache_mlm")
    9: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/load.py", line 2152, in load_dataset
    9: builder_instance.download_and_prepare(
    9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 948, in download_and_prepare
    9: self._download_and_prepare(
    9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 1045, in _download_and_prepare
    9: raise OSError(
    9: OSError: Cannot find data file.
    9: Original error:
    9: [Errno 2] No such file or directory: '/scrip_continual_pretraining/Cache_mlm/text/default-d0870639fca1403e/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34.incomplete/text-train-00000-00001-of-NNNNN.arrow'
    Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 17962.76it/s]
    Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 59.23it/s]
    Generating validation split: 3246757 examples [00:07, 410442.68 examples/s]
    srun: error: jean-zay-iam07: task 14: Exited with exit code 1
    srun: Terminating StepId=1741717.0
    0: slurmstepd: error: *** STEP 1741717.0 ON jean-zay-iam05 CANCELLED AT 2024-05-04T06:15:08 ***
    Generating train split: 122206 examples [00:00, 396783.29 examples/s]
    2: split: 3308013 examples [00:07, 424316.76 examples/s]
    srun: error: jean-zay-iam07: tasks 12-13,15-16: Terminated
    srun: error: jean-zay-iam05: tasks 0-2,4-5: Terminated
    srun: error: jean-zay-iam06: tasks 7-11: Terminated
    Generating train split: 489938 examples [00:01, 440569.16 examples/s]
    srun: error: jean-zay-iam07: task 17: Terminated
    srun: error: jean-zay-iam05: task 3: Terminated
    srun: error: jean-zay-iam06: task 6: Terminated
    srun: Force Terminated StepId=1741717.0`

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

14: OSError: [Errno 39] Directory not empty: '/scrip_continual_pretraining/Cache_mlm/text/default-d0870639fca1403e/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34.incomplete'
Generating train split: 20130863 examples [00:54, 368921.36 examples/s]
9: Traceback (most recent call last):
9: File "/gpfsdswork/projects/rech/khy/uvb95lb/scrip_continual_pretraining/run_mlm_wwm.py", line 450, in
9: main()
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
9: return f(*args, **kwargs)
9: ^^^^^^^^^^^^^^^^^^
9: File "/scrip_continual_pretraining/run_mlm_wwm.py", line 294, in main
9: datasets = load_dataset(extension, data_files=data_files, cache_dir="/scrip_continual_pretraining/Cache_mlm")
9: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/load.py", line 2152, in load_dataset
9: builder_instance.download_and_prepare(
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 948, in download_and_prepare
9: self._download_and_prepare(
9: File "/gpfslocalsup/pub/anaconda-py3/2023.09/envs/pytorch-gpu-2.1.1+py3.11.5/lib/python3.11/site-packages/datasets/builder.py", line 1045, in _download_and_prepare
9: raise OSError(
9: OSError: Cannot find data file.
9: Original error:
9: [Errno 2] No such file or directory: '/scrip_continual_pretraining/Cache_mlm/text/default-d0870639fca1403e/0.0.0/c4a140d10f020282918b5dd1b8a49f0104729c6177f60a6b49ec2a365ec69f34.incomplete/text-train-00000-00001-of-NNNNN.arrow'
Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 17962.76it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 59.23it/s]
Generating validation split: 3246757 examples [00:07, 410442.68 examples/s]
srun: error: jean-zay-iam07: task 14: Exited with exit code 1
srun: Terminating StepId=1741717.0
0: slurmstepd: error: *** STEP 1741717.0 ON jean-zay-iam05 CANCELLED AT 2024-05-04T06:15:08 ***
Generating train split: 122206 examples [00:00, 396783.29 examples/s]
2: split: 3308013 examples [00:07, 424316.76 examples/s]
srun: error: jean-zay-iam07: tasks 12-13,15-16: Terminated
srun: error: jean-zay-iam05: tasks 0-2,4-5: Terminated

Expected behavior

I expect that my training succed to run at least one epoch

@amyeroberts
Copy link
Collaborator

cc @muellerzr @pacman100

@muellerzr
Copy link
Contributor

@yuane4 can you please share your entire script?

@yuane4
Copy link
Author

yuane4 commented May 17, 2024

yes of course, is there the script, I made little adjustement to the original script, because the server I use to do my training create there own library to manage parallelisation called idr_torch, you can find the lines added in the main function :

import logging
import math
import os
import sys
from dataclasses import dataclass, field
from typing import Optional

from datasets import Dataset, load_dataset

import transformers
from transformers import (
    CONFIG_MAPPING,
    MODEL_FOR_MASKED_LM_MAPPING,
    AutoConfig,
    AutoModelForMaskedLM,
    AutoTokenizer,
    DataCollatorForWholeWordMask,
    HfArgumentParser,
    Trainer,
    TrainingArguments,
    set_seed,
)
from transformers.trainer_utils import get_last_checkpoint, is_main_process

import torch
from torch.distributed.elastic.multiprocessing.errors import record
import torch.distributed as dist
import idr_torch

logger = logging.getLogger(__name__)
MODEL_CONFIG_CLASSES = list(MODEL_FOR_MASKED_LM_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)


@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
    """

    model_name_or_path: Optional[str] = field(
        default=None,
        metadata={
            "help": (
                "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch."
            )
        },
    )
    model_type: Optional[str] = field(
        default=None,
        metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
    )
    config_overrides: Optional[str] = field(
        default=None,
        metadata={
            "help": (
                "Override some existing default config settings when a model is trained from scratch. Example: "
                "n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index"
            )
        },
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
    )
    use_fast_tokenizer: bool = field(
        default=True,
        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
    )
    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
    use_auth_token: bool = field(
        default=False,
        metadata={
            "help": (
                "Will use the token generated when running `huggingface-cli login` (necessary to use this script "
                "with private models)."
            )
        },
    )

    def __post_init__(self):
        if self.config_overrides is not None and (self.config_name is not None or self.model_name_or_path is not None):
            raise ValueError(
                "--config_overrides can't be used in combination with --config_name or --model_name_or_path"
            )


@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    dataset_name: Optional[str] = field(
        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
    )
    dataset_config_name: Optional[str] = field(
        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
    )
    train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})
    validation_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
    )
    train_ref_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input train ref data file for whole word masking in Chinese."},
    )
    validation_ref_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input validation ref data file for whole word masking in Chinese."},
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )
    validation_split_percentage: Optional[int] = field(
        default=5,
        metadata={
            "help": "The percentage of the train set used as validation set in case there's no validation split"
        },
    )
    max_seq_length: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "The maximum total input sequence length after tokenization. Sequences longer "
                "than this will be truncated. Default to the max input length of the model."
            )
        },
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    mlm_probability: float = field(
        default=0.15, metadata={"help": "Ratio of tokens to mask for masked language modeling loss"}
    )
    pad_to_max_length: bool = field(
        default=False,
        metadata={
            "help": (
                "Whether to pad all samples to `max_seq_length`. "
                "If False, will pad the samples dynamically when batching to the maximum length in the batch."
            )
        },
    )

    def __post_init__(self):
        if self.train_file is not None:
            extension = self.train_file.split(".")[-1]
            assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, a json or a txt file."
        if self.validation_file is not None:
            extension = self.validation_file.split(".")[-1]
            assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, a json or a txt file."


def add_chinese_references(dataset, ref_file):
    with open(ref_file, "r", encoding="utf-8") as f:
        refs = [json.loads(line) for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
    assert len(dataset) == len(refs)

    dataset_dict = {c: dataset[c] for c in dataset.column_names}
    dataset_dict["chinese_ref"] = refs
    return Dataset.from_dict(dataset_dict)


@record
def main():

    os.environ['LOCAL_RANK'] = os.environ['SLURM_LOCALID']

    dist.init_process_group(backend='nccl',
                            init_method='env://',
                            world_size=idr_torch.size,
                            rank=idr_torch.rank)

    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
        # If we pass only one argument to the script and it's the path to a json file,
        # let's parse it to get our arguments.
        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

    training_args.local_rank = idr_torch.local_rank

    # Detecting last checkpoint.
    last_checkpoint = None
    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
        last_checkpoint = get_last_checkpoint(training_args.output_dir)
        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
            raise ValueError(
                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
                "Use --overwrite_output_dir to overcome."
            )
        elif last_checkpoint is not None:
            logger.info(
                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
            )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)

    # Log on each process the small summary:
    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )
    # Set the verbosity to info of the Transformers logger (on main process only):
    if is_main_process(training_args.local_rank):
        transformers.utils.logging.set_verbosity_info()
        transformers.utils.logging.enable_default_handler()
        transformers.utils.logging.enable_explicit_format()
    logger.info("Training/evaluation parameters %s", training_args)

    # Set seed before initializing model.
    set_seed(training_args.seed)

    # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
    # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
    # (the dataset will be downloaded automatically from the datasets Hub).
    #
    # For CSV/JSON files, this script will use the column called 'text' or the first column if no column called
    # 'text' is found. You can easily tweak this behavior (see below).
    #
    # In distributed training, the load_dataset function guarantee that only one local process can concurrently
    # download the dataset.
    if data_args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name, cache_dir="/gpfswork/rech/khy/uvb95lb/scrip_continual_pretraining/Cache_mlm")
        if "validation" not in datasets.keys():
            datasets["validation"] = load_dataset(
                data_args.dataset_name,
                data_args.dataset_config_name,
                split=f"train[:{data_args.validation_split_percentage}%]",
            )
            datasets["train"] = load_dataset(
                data_args.dataset_name,
                data_args.dataset_config_name,
                split=f"train[{data_args.validation_split_percentage}%:]",
            )
    else:
        data_files = {}
        if data_args.train_file is not None:
            data_files["train"] = data_args.train_file
            extension = data_args.train_file.split(".")[-1]
        if data_args.validation_file is not None:
            data_files["validation"] = data_args.validation_file
            extension = data_args.validation_file.split(".")[-1]
        if extension == "txt":
            extension = "text"
        datasets = load_dataset(extension, data_files=data_files, cache_dir="/gpfswork/rech/khy/uvb95lb/scrip_continual_pretraining/Cache_mlm")
    # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
    # https://huggingface.co/docs/datasets/loading_datasets.

    # Load pretrained model and tokenizer
    #
    # Distributed training:
    # The .from_pretrained methods guarantee that only one local process can concurrently
    # download model & vocab.
    config_kwargs = {
        "cache_dir": model_args.cache_dir,
        "revision": model_args.model_revision,
        "use_auth_token": True if model_args.use_auth_token else None,
    }
    if model_args.config_name:
        config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs)
    elif model_args.model_name_or_path:
        config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
    else:
        config = CONFIG_MAPPING[model_args.model_type]()
        logger.warning("You are instantiating a new config instance from scratch.")
        if model_args.config_overrides is not None:
            logger.info(f"Overriding config: {model_args.config_overrides}")
            config.update_from_string(model_args.config_overrides)
            logger.info(f"New config: {config}")

    tokenizer_kwargs = {
        "cache_dir": model_args.cache_dir,
        "use_fast": model_args.use_fast_tokenizer,
        "revision": model_args.model_revision,
        "use_auth_token": True if model_args.use_auth_token else None,
    }
    if model_args.tokenizer_name:
        tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
    elif model_args.model_name_or_path:
        tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
    else:
        raise ValueError(
            "You are instantiating a new tokenizer from scratch. This is not supported by this script. "
            "You can do it from another script, save it, and load it from here, using --tokenizer_name."
        )

    if model_args.model_name_or_path:
        model = AutoModelForMaskedLM.from_pretrained(
            model_args.model_name_or_path,
            from_tf=bool(".ckpt" in model_args.model_name_or_path),
            config=config,
            cache_dir=model_args.cache_dir,
            revision=model_args.model_revision,
            use_auth_token=True if model_args.use_auth_token else None,
            local_files_only=True,
        )
    else:
        logger.info("Training new model from scratch")
        model = AutoModelForMaskedLM.from_config(config)

    model.resize_token_embeddings(len(tokenizer))

    # Preprocessing the datasets.
    # First we tokenize all the texts.
    if training_args.do_train:
        column_names = datasets["train"].column_names
    else:
        column_names = datasets["validation"].column_names
    text_column_name = "text" if "text" in column_names else column_names[0]

    padding = "max_length" if data_args.pad_to_max_length else False

    def tokenize_function(examples):
        # Remove empty lines
        examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()]
        return tokenizer(examples["text"], padding=padding, truncation=True, max_length=data_args.max_seq_length)

    tokenized_datasets = datasets.map(
        tokenize_function,
        batched=True,
        num_proc=data_args.preprocessing_num_workers,
        remove_columns=[text_column_name],
        load_from_cache_file=not data_args.overwrite_cache
    )

    # Add the chinese references if provided
    if data_args.train_ref_file is not None:
        tokenized_datasets["train"] = add_chinese_references(tokenized_datasets["train"], data_args.train_ref_file)
    if data_args.validation_ref_file is not None:
        tokenized_datasets["validation"] = add_chinese_references(
            tokenized_datasets["validation"], data_args.validation_ref_file
        )
    # If we have ref files, need to avoid it removed by trainer
    has_ref = data_args.train_ref_file or data_args.validation_ref_file
    if has_ref:
        training_args.remove_unused_columns = False

    # Data collator
    # This one will take care of randomly masking the tokens.
    data_collator = DataCollatorForWholeWordMask(tokenizer=tokenizer, mlm_probability=data_args.mlm_probability)

    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"] if training_args.do_train else None,
        eval_dataset=tokenized_datasets["validation"] if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=data_collator,
    )

    # Training
    if training_args.do_train:
        if last_checkpoint is not None:
            checkpoint = last_checkpoint
        elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path):
            checkpoint = model_args.model_name_or_path
        else:
            checkpoint = None
        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()  # Saves the tokenizer too for easy upload

        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
        if trainer.is_world_process_zero():
            with open(output_train_file, "w") as writer:
                logger.info("***** Train results *****")
                for key, value in sorted(train_result.metrics.items()):
                    logger.info(f"  {key} = {value}")
                    writer.write(f"{key} = {value}\n")

            # Need to save the state, since Trainer.save_model saves only the tokenizer with the model
            trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))

    # Evaluation
    results = {}
    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        eval_output = trainer.evaluate()

        perplexity = math.exp(eval_output["eval_loss"])
        results["perplexity"] = perplexity

        output_eval_file = os.path.join(training_args.output_dir, "eval_results_mlm_wwm.txt")
        if trainer.is_world_process_zero():
            with open(output_eval_file, "w") as writer:
                logger.info("***** Eval results *****")
                for key, value in sorted(results.items()):
                    logger.info(f"  {key} = {value}")
                    writer.write(f"{key} = {value}\n")

    return results


def _mp_fn(index):
    # For xla_spawn (TPUs)
    main()


if __name__ == "__main__":
    main()```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants