Questions about evaluate time #40

Yuan0320 · 2023-09-16T04:12:40Z

Hi @HZQ950419, thanks for your great work! Here I wonder how long your evaluation phase took? I used a single V100, and the evaluation phase seems a bit time consuming, e.g. I spent about 5 hours on the AddSub test set. Is this normal?

  0%|                                                                                                                                                                                                                                                     | 0/395 [00:00<?, ?it/s]
---------------
A: There were 7 crayons in the drawer. Mary took 3 out, so now there are 7 - 3 = 4 crayons in the drawer. The answer is 4.<unk>� (Note: The answer may not always be so simple. In this case, the answer is 4 because there are only 4 crayons in the drawer. If there were 10 crayons in the drawer, the answer would be 7 - 3 = 4. If there were 100 crayons in the drawer, the answer would be 7 - 3 = 4. If there were 1000 crayons in the drawer, the answer would be 7 - 3 = 4. In general, the answer is 7 - x = 4, where x is the number of crayons in the drawer. The answer is 4 in this case because there are 4 crayons in the drawer. The answer may not always be so simple. In this case, the answer is 4 because there are only 4 crayons in the drawer. If there were 10 c
prediction: 10.0
label: 4.0
---------------
test:1/395 | accuracy 0  0.0
  0%|▌                                                                                                                                                                                                                                          | 1/395 [00:54<5:58:23, 54.58s/it]
---------------
A: There are 3 gallons in the bucket. Derek adds 6.8 gallons more. So in total there are 3 + 6.8 = 9.8 gallons. The answer is 9.8 gallons.<unk>
<unk>]{' Instruction:', 'Response:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:',
prediction: 9.8
label: 9.8
---------------
test:2/395 | accuracy 1  0.5
  1%|█▏                                                                                                                                                                                                                                         | 2/395 [01:46<5:46:58, 52.97s/it]

The text was updated successfully, but these errors were encountered:

HZQ950419 · 2023-09-17T05:42:39Z

Hi,

Yes, the evaluation will take a very long time. The gsm8k will take around 20+ hours for V100 32G without load_8bit.
Please let us know if you have further questions!

Yuan0320 · 2023-09-18T05:56:40Z

Thanks for the response! I ran the experiments and found that for some datasets, the evaluation accuracy differed a bit more than the results reported in the paper (e.g. AddSub 86.58 vs. 78.5) , which I trained using math_data.json. Do you know why? Maybe the envs, GPUs? I'm not sure if I'm going wrong somewhere.

Model	MultiArith	GSM8K	AddSub	AQuA	SingleEq	SVAMP
LLaMa-7B + LoRA (paper reported)	88.3	21.9	78.5	27.5	83.3	54.5
LLaMa-7B + LoRA (my reproduced)	85.83	25.85	86.6	17.32	84.65	65.4

HZQ950419 · 2023-09-22T11:21:11Z

Hi,

please use the following command to reproduce the result for LLaMA-7B-LoRA.

CUDA_VISIBLE_DEVICES=0 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path 'math_10k.json' --output_dir './trained_models/llama-7b-lora-math/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

Yuan0320 · 2023-09-26T15:42:08Z

Thanks! This helps me a lot. Currently the training dataset, e.g., math_data.json, covers all six math reasoning datasets, I wonder that can we get the training data for each datasets? Or other ways to get the dataset to which each sample in the training data belongs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about evaluate time #40

Questions about evaluate time #40

Yuan0320 commented Sep 16, 2023 •

edited

HZQ950419 commented Sep 17, 2023

Yuan0320 commented Sep 18, 2023 •

edited

HZQ950419 commented Sep 22, 2023

Yuan0320 commented Sep 26, 2023

Questions about evaluate time #40

Questions about evaluate time #40

Comments

Yuan0320 commented Sep 16, 2023 • edited

HZQ950419 commented Sep 17, 2023

Yuan0320 commented Sep 18, 2023 • edited

HZQ950419 commented Sep 22, 2023

Yuan0320 commented Sep 26, 2023

Yuan0320 commented Sep 16, 2023 •

edited

Yuan0320 commented Sep 18, 2023 •

edited