Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training bug for 13b, 30b, and 65b #285

Open
alexgshaw opened this issue Jun 16, 2023 · 5 comments
Open

Training bug for 13b, 30b, and 65b #285

alexgshaw opened this issue Jun 16, 2023 · 5 comments

Comments

@alexgshaw
Copy link

Has anyone been able to finetune any of the models larger than 7b successfully? I'm training on 8 A100s with 80GB of RAM each which is more than enough space.

The problem I'm running into is the first loss being massive (1e5) and subsequent losses being 0 after the first step. Not sure how to fix this or what is causing this, as the 7b model trains fine. I'm training with deepspeed launcher.

Here's an example of the output when training the 65b model.

  0%|          | 0/82 [00:00<?, ?it/s]
  1%|          | 1/82 [02:38<3:34:18, 158.74s/it]
  2%|▏         | 2/82 [04:39<3:01:49, 136.37s/it]
  4%|▎         | 3/82 [06:40<2:50:13, 129.28s/it]
  5%|▍         | 4/82 [08:39<2:43:07, 125.48s/it]
  6%|▌         | 5/82 [10:39<2:38:05, 123.19s/it]
  7%|▋         | 6/82 [12:39<2:34:54, 122.30s/it]
  9%|▊         | 7/82 [14:39<2:31:49, 121.46s/it]
 10%|▉         | 8/82 [16:38<2:28:55, 120.75s/it]
 11%|█         | 9/82 [18:38<2:26:39, 120.54s/it]
 12%|█▏        | 10/82 [20:38<2:24:23, 120.33s/it]
                                                  
{'loss': 121486.8, 'learning_rate': 0.0, 'epoch': 0.02}

 12%|█▏        | 10/82 [20:38<2:24:23, 120.33s/it]
 13%|█▎        | 11/82 [22:38<2:22:06, 120.10s/it]
 15%|█▍        | 12/82 [24:38<2:20:08, 120.12s/it]
 16%|█▌        | 13/82 [26:38<2:18:11, 120.17s/it]
 17%|█▋        | 14/82 [28:38<2:16:09, 120.15s/it]
 18%|█▊        | 15/82 [30:39<2:14:13, 120.21s/it]
 20%|█▉        | 16/82 [32:39<2:12:09, 120.15s/it]
 21%|██        | 17/82 [34:38<2:09:52, 119.89s/it]
 22%|██▏       | 18/82 [36:37<2:07:41, 119.71s/it]
 23%|██▎       | 19/82 [38:36<2:05:26, 119.47s/it]
 24%|██▍       | 20/82 [40:36<2:03:32, 119.55s/it]
                                                  
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}

 24%|██▍       | 20/82 [40:36<2:03:32, 119.55s/it]
 26%|██▌       | 21/82 [42:42<2:03:38, 121.61s/it]
 27%|██▋       | 22/82 [44:43<2:01:21, 121.36s/it]
 28%|██▊       | 23/82 [46:42<1:58:48, 120.81s/it]
 29%|██▉       | 24/82 [48:42<1:56:19, 120.34s/it]
 30%|███       | 25/82 [50:41<1:54:01, 120.03s/it]
 32%|███▏      | 26/82 [52:39<1:51:33, 119.53s/it]
 33%|███▎      | 27/82 [54:39<1:49:29, 119.44s/it]
 34%|███▍      | 28/82 [56:38<1:47:29, 119.43s/it]
 35%|███▌      | 29/82 [58:37<1:45:17, 119.20s/it]
 37%|███▋      | 30/82 [1:00:37<1:43:29, 119.42s/it]
                                                    
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.07}
@alexgshaw
Copy link
Author

My finetuning arguments:

    --model_name_or_path /home/ashaw8/compute/$MODEL_DIR/$MODEL_NAME \
    --data_path ./alpaca_data.json \
    --run_name $RUN_NAME \
    --bf16 True \
    --output_dir $OUTPUT_DIR \
    --logging_dir $LOGGING_DIR \
    --num_train_epochs 0.2 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy no \
    --save_strategy no \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --warmup_steps 50 \
    --lr_scheduler_type linear \
    --weight_decay 0.1 \
    --deepspeed ./configs/default_offload_opt_param.json \
    --tf32 True \
    --logging_strategy steps \
    --logging_steps 10 \
    --report_to wandb \

@yh0903
Copy link

yh0903 commented Jun 23, 2023

Same here. Could you share how did you slove this eventually? Thanks

@yxchng
Copy link

yxchng commented Jun 25, 2023

are you able to train with batch size 4 as in the readme?

@alexgshaw
Copy link
Author

Haven't solved it yet, but switching from the huggingface trainer to pytorch lightning might solve the issue. If I can get it to work I'll post a link to a repo with everything set up.

Also, I switched to a different machine with V100s instead of A100s and 13b works on there. Could also be a version difference because I can work with docker containers on the V100 machine but only with venvs on the A100 machine (admins are stingy about root access).

Also, yes, I'm able to train with batch size of 4, but that does not make a difference.

@alexgshaw
Copy link
Author

It seems like this might be a related issue:

huggingface/transformers#14531

I turned off bf16 and it fixed my issue with 13b and 30b. Without bf16 I can't fit 65b onto my GPUs so I haven't tested that one yet.

Any idea why bf16 is causing this problem? I think it's preventing the optimizer from stepping but have no idea why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants