Training bug for 13b, 30b, and 65b #285

alexgshaw · 2023-06-16T22:39:32Z

Has anyone been able to finetune any of the models larger than 7b successfully? I'm training on 8 A100s with 80GB of RAM each which is more than enough space.

The problem I'm running into is the first loss being massive (1e5) and subsequent losses being 0 after the first step. Not sure how to fix this or what is causing this, as the 7b model trains fine. I'm training with deepspeed launcher.

Here's an example of the output when training the 65b model.

  0%|          | 0/82 [00:00<?, ?it/s]
  1%|          | 1/82 [02:38<3:34:18, 158.74s/it]
  2%|▏         | 2/82 [04:39<3:01:49, 136.37s/it]
  4%|▎         | 3/82 [06:40<2:50:13, 129.28s/it]
  5%|▍         | 4/82 [08:39<2:43:07, 125.48s/it]
  6%|▌         | 5/82 [10:39<2:38:05, 123.19s/it]
  7%|▋         | 6/82 [12:39<2:34:54, 122.30s/it]
  9%|▊         | 7/82 [14:39<2:31:49, 121.46s/it]
 10%|▉         | 8/82 [16:38<2:28:55, 120.75s/it]
 11%|█         | 9/82 [18:38<2:26:39, 120.54s/it]
 12%|█▏        | 10/82 [20:38<2:24:23, 120.33s/it]
                                                  
{'loss': 121486.8, 'learning_rate': 0.0, 'epoch': 0.02}

 12%|█▏        | 10/82 [20:38<2:24:23, 120.33s/it]
 13%|█▎        | 11/82 [22:38<2:22:06, 120.10s/it]
 15%|█▍        | 12/82 [24:38<2:20:08, 120.12s/it]
 16%|█▌        | 13/82 [26:38<2:18:11, 120.17s/it]
 17%|█▋        | 14/82 [28:38<2:16:09, 120.15s/it]
 18%|█▊        | 15/82 [30:39<2:14:13, 120.21s/it]
 20%|█▉        | 16/82 [32:39<2:12:09, 120.15s/it]
 21%|██        | 17/82 [34:38<2:09:52, 119.89s/it]
 22%|██▏       | 18/82 [36:37<2:07:41, 119.71s/it]
 23%|██▎       | 19/82 [38:36<2:05:26, 119.47s/it]
 24%|██▍       | 20/82 [40:36<2:03:32, 119.55s/it]
                                                  
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}

 24%|██▍       | 20/82 [40:36<2:03:32, 119.55s/it]
 26%|██▌       | 21/82 [42:42<2:03:38, 121.61s/it]
 27%|██▋       | 22/82 [44:43<2:01:21, 121.36s/it]
 28%|██▊       | 23/82 [46:42<1:58:48, 120.81s/it]
 29%|██▉       | 24/82 [48:42<1:56:19, 120.34s/it]
 30%|███       | 25/82 [50:41<1:54:01, 120.03s/it]
 32%|███▏      | 26/82 [52:39<1:51:33, 119.53s/it]
 33%|███▎      | 27/82 [54:39<1:49:29, 119.44s/it]
 34%|███▍      | 28/82 [56:38<1:47:29, 119.43s/it]
 35%|███▌      | 29/82 [58:37<1:45:17, 119.20s/it]
 37%|███▋      | 30/82 [1:00:37<1:43:29, 119.42s/it]
                                                    
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.07}

The text was updated successfully, but these errors were encountered:

alexgshaw · 2023-06-16T22:42:56Z

My finetuning arguments:

    --model_name_or_path /home/ashaw8/compute/$MODEL_DIR/$MODEL_NAME \
    --data_path ./alpaca_data.json \
    --run_name $RUN_NAME \
    --bf16 True \
    --output_dir $OUTPUT_DIR \
    --logging_dir $LOGGING_DIR \
    --num_train_epochs 0.2 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy no \
    --save_strategy no \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --warmup_steps 50 \
    --lr_scheduler_type linear \
    --weight_decay 0.1 \
    --deepspeed ./configs/default_offload_opt_param.json \
    --tf32 True \
    --logging_strategy steps \
    --logging_steps 10 \
    --report_to wandb \

yh0903 · 2023-06-23T06:35:27Z

Same here. Could you share how did you slove this eventually? Thanks

yxchng · 2023-06-25T02:31:16Z

are you able to train with batch size 4 as in the readme?

alexgshaw · 2023-06-26T17:48:27Z

Haven't solved it yet, but switching from the huggingface trainer to pytorch lightning might solve the issue. If I can get it to work I'll post a link to a repo with everything set up.

Also, I switched to a different machine with V100s instead of A100s and 13b works on there. Could also be a version difference because I can work with docker containers on the V100 machine but only with venvs on the A100 machine (admins are stingy about root access).

Also, yes, I'm able to train with batch size of 4, but that does not make a difference.

alexgshaw · 2023-07-06T22:03:41Z

It seems like this might be a related issue:

huggingface/transformers#14531

I turned off bf16 and it fixed my issue with 13b and 30b. Without bf16 I can't fit 65b onto my GPUs so I haven't tested that one yet.

Any idea why bf16 is causing this problem? I think it's preventing the optimizer from stepping but have no idea why.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training bug for 13b, 30b, and 65b #285

Training bug for 13b, 30b, and 65b #285

alexgshaw commented Jun 16, 2023

alexgshaw commented Jun 16, 2023

yh0903 commented Jun 23, 2023

yxchng commented Jun 25, 2023

alexgshaw commented Jun 26, 2023

alexgshaw commented Jul 6, 2023

Training bug for 13b, 30b, and 65b #285

Training bug for 13b, 30b, and 65b #285

Comments

alexgshaw commented Jun 16, 2023

alexgshaw commented Jun 16, 2023

yh0903 commented Jun 23, 2023

yxchng commented Jun 25, 2023

alexgshaw commented Jun 26, 2023

alexgshaw commented Jul 6, 2023