Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about evaluate time #40

Open
Yuan0320 opened this issue Sep 16, 2023 · 4 comments
Open

Questions about evaluate time #40

Yuan0320 opened this issue Sep 16, 2023 · 4 comments

Comments

@Yuan0320
Copy link

Yuan0320 commented Sep 16, 2023

Hi @HZQ950419, thanks for your great work! Here I wonder how long your evaluation phase took? I used a single V100, and the evaluation phase seems a bit time consuming, e.g. I spent about 5 hours on the AddSub test set. Is this normal?

  0%|                                                                                                                                                                                                                                                     | 0/395 [00:00<?, ?it/s]
---------------
A: There were 7 crayons in the drawer. Mary took 3 out, so now there are 7 - 3 = 4 crayons in the drawer. The answer is 4.<unk>� (Note: The answer may not always be so simple. In this case, the answer is 4 because there are only 4 crayons in the drawer. If there were 10 crayons in the drawer, the answer would be 7 - 3 = 4. If there were 100 crayons in the drawer, the answer would be 7 - 3 = 4. If there were 1000 crayons in the drawer, the answer would be 7 - 3 = 4. In general, the answer is 7 - x = 4, where x is the number of crayons in the drawer. The answer is 4 in this case because there are 4 crayons in the drawer. The answer may not always be so simple. In this case, the answer is 4 because there are only 4 crayons in the drawer. If there were 10 c
prediction: 10.0
label: 4.0
---------------
test:1/395 | accuracy 0  0.0
  0%|▌                                                                                                                                                                                                                                          | 1/395 [00:54<5:58:23, 54.58s/it]
---------------
A: There are 3 gallons in the bucket. Derek adds 6.8 gallons more. So in total there are 3 + 6.8 = 9.8 gallons. The answer is 9.8 gallons.<unk>
<unk>]{' Instruction:', 'Response:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:',
prediction: 9.8
label: 9.8
---------------
test:2/395 | accuracy 1  0.5
  1%|█▏                                                                                                                                                                                                                                         | 2/395 [01:46<5:46:58, 52.97s/it]
@HZQ950419
Copy link
Collaborator

Hi,

Yes, the evaluation will take a very long time. The gsm8k will take around 20+ hours for V100 32G without load_8bit.
Please let us know if you have further questions!

@Yuan0320
Copy link
Author

Yuan0320 commented Sep 18, 2023

Thanks for the response! I ran the experiments and found that for some datasets, the evaluation accuracy differed a bit more than the results reported in the paper (e.g. AddSub 86.58 vs. 78.5) , which I trained using math_data.json. Do you know why? Maybe the envs, GPUs? I'm not sure if I'm going wrong somewhere.

Model MultiArith GSM8K AddSub AQuA SingleEq SVAMP
LLaMa-7B + LoRA (paper reported) 88.3 21.9 78.5 27.5 83.3 54.5
LLaMa-7B + LoRA (my reproduced) 85.83 25.85 86.6 17.32 84.65 65.4

@HZQ950419
Copy link
Collaborator

Hi,

please use the following command to reproduce the result for LLaMA-7B-LoRA.

CUDA_VISIBLE_DEVICES=0 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path 'math_10k.json' --output_dir './trained_models/llama-7b-lora-math/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

@Yuan0320
Copy link
Author

Thanks! This helps me a lot. Currently the training dataset, e.g., math_data.json, covers all six math reasoning datasets, I wonder that can we get the training data for each datasets? Or other ways to get the dataset to which each sample in the training data belongs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants