Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchrun breaks with load_model_at_end and with metric_for_best_model=eval_f1 on question_answering example #30819

Open
3 of 4 tasks
godspeed5 opened this issue May 15, 2024 · 0 comments
Labels

Comments

@godspeed5
Copy link

System Info

  • transformers version: 4.41.0.dev0
  • Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.31
  • Python version: 3.10.14
  • Huggingface_hub version: 0.23.0
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: tried using ddp, but the setting is single system, multi-gpu

Who can help?

@muellerzr @pacman100 @ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. I clone the main branch of transformers and pip install -e . in the cloned transformers folder.
  2. I then run torchrun --nproc_per_node 2 run_qa.py --model_name_or_path google-bert/bert- base-uncased --dataset_name squad --do_train --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad --max_steps 20 --eval_steps 2 --save_steps 2 --save_total_limit 2 --load_best_model_at_end True --metric_for_best_model eval_f1 --max_eval_samples 20 --eval_strategy steps --save_strategy steps 2>&1 | tee scratch.log
  3. The code errors out with KeyError: eval_f1.
  4. I believe this happens because the compute_metrics function computes the eval_f1 metric on one process but the trainer _save_checkpoint() method checks for the metric on all processes and therefore, some other process beats process 0 doesn't find the key leading to this error. (
    if metrics is not None and self.args.metric_for_best_model is not None:
    )

Expected behavior

Ideally, the code should seamlessly run using torchrun. There should be no key error. The trainer should be able to handle single process eval_f1 along with multi-process metric computation done in other workloads such as summarization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants