Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the reproducation of the results in the math_10k #58

Open
zeyuliu1037 opened this issue Feb 29, 2024 · 13 comments
Open

Question about the reproducation of the results in the math_10k #58

zeyuliu1037 opened this issue Feb 29, 2024 · 13 comments

Comments

@zeyuliu1037
Copy link

Hi, thank you for your awesome work!

I have one question about the training on the math_10k dataset.
python finetune.py --base_model 'yahma/llama-7b-hf' --data_path 'ft-training_set/math_10k.json' --output_dir './trained_models/llama-7b-lora-math/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

But I only get 16.14 on AQuA and 46.9 on SVAMP, but in the table it should be 18.9 on AQuA and 52.1 on SVAMP.
I'm using the peft library from the GitHub repo. Do you have any insights on this? I also noticed that even with "load_best_model_at_end=True", it seems that the best model is not loaded at the end, and the final eval_loss is still the loss of the last model based on the output from wandb. Is this correct?

Thank you so much in advance.

@HZQ950419
Copy link
Collaborator

Hi,

Can I ask if you used multi-gpu for training? If yes, please try with single GPU.

@zeyuliu1037
Copy link
Author

I use a single GPU.

@Zhenyu001225
Copy link

Hi, thank you for your awesome work!

I have one question about the training on the math_10k dataset. python finetune.py --base_model 'yahma/llama-7b-hf' --data_path 'ft-training_set/math_10k.json' --output_dir './trained_models/llama-7b-lora-math/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

But I only get 16.14 on AQuA and 46.9 on SVAMP, but in the table it should be 18.9 on AQuA and 52.1 on SVAMP. I'm using the peft library from the GitHub repo. Do you have any insights on this? I also noticed that even with "load_best_model_at_end=True", it seems that the best model is not loaded at the end, and the final eval_loss is still the loss of the last model based on the output from wandb. Is this correct?

Thank you so much in advance.

Hi, did you solve this problem? My results are close to yours.

@zeyuliu1037
Copy link
Author

Hi, did you solve this problem? My results are close to yours.

Unfortunately, I haven't made it yet.

@Zhenyu001225
Copy link

Hi, did you solve this problem? My results are close to yours.

Unfortunately, I haven't made it yet.

You can use transformers==4.35.0 These results will be close to authors

@zeyuliu1037
Copy link
Author

Thank you so much!!!

@Aradhye2002
Copy link

@Zhenyu001225 any idea why this happens? An extreme case is for transformers 4.40.0 which gave me gibberish output as mentioned in this issue.

Thanks

@Zhenyu001225
Copy link

@Zhenyu001225 any idea why this happens? An extreme case is for transformers 4.40.0 which gave me gibberish output as mentioned in this issue.

Thanks

I think it's because of the tokenizer version.
For math, you can try:

CUDA_VISIBLE_DEVICES=1 python finetune.py
--base_model 'yahma/llama-7b-hf'
--data_path './ft-training_set/math_10k.json'
--output_dir './trained_models/llama-7b-lora-math/'
--batch_size 16
--micro_batch_size 4
--num_epochs 3
--learning_rate 3e-4
--cutoff_len 256
--val_set_size 0
--eval_step 80
--save_step 80
--adapter_name lora
--target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]'
--lora_r 32
--lora_alpha 64

For Commense:

CUDA_VISIBLE_DEVICES=8 python finetune.py
--base_model 'yahma/llama-7b-hf'
--data_path 'ft-training_set/commonsense_170k.json'
--output_dir './trained_models/llama-7b-lora-commonsense/'
--batch_size 16
--micro_batch_size 4
--num_epochs 3
--learning_rate 3e-4
--cutoff_len 256
--val_set_size 120
--eval_step 80
--save_step 80
--adapter_name lora
--target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]'
--lora_r 32
--lora_alpha 64

@zeyuliu1037
Copy link
Author

@Zhenyu001225 any idea why this happens? An extreme case is for transformers 4.40.0 which gave me gibberish output as mentioned in this issue.
Thanks

I think it's because of the tokenizer version. For math, you can try:

CUDA_VISIBLE_DEVICES=1 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path './ft-training_set/math_10k.json' --output_dir './trained_models/llama-7b-lora-math/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 0 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

For Commense:

CUDA_VISIBLE_DEVICES=8 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path 'ft-training_set/commonsense_170k.json' --output_dir './trained_models/llama-7b-lora-commonsense/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

Hi, can you kindly share your requirement.txt with versions? I think besides the version of transformers, the versions of accelerate and tokenizers also affect the results. Thank you so much!

@ZeguanXiao
Copy link

ZeguanXiao commented May 7, 2024

@Zhenyu001225 When switching to transformers 4.35.0, the training is very unstable as training loss goes to 0 and validation loss goes to nan. Do you have the same problem?

@YYing0111
Copy link

@Zhenyu001225 When switching to transformers 4.35.0, the training is very unstable as training loss goes to 0 and validation loss goes to nan. Do you have the same problem?

Hi, I have the same problem. Did you solve it?

@ZeguanXiao
Copy link

ZeguanXiao commented May 31, 2024

@YYing0111 Try installing transformers with
git+https://github.com/yizhongw/transformers.git@left_padding

@pkhanna7
Copy link

pkhanna7 commented Jun 2, 2024

Hi, I finetuned the Llama-7B model using LoRA with math_10k on a single A100 GPU with transformers==4.35.0, but still got a much lower accuracy (27.2%) on SVAMP compared to the reported numbers (52.1%).

From a manual analysis of the generated responses, it seems that the model is generating a lot of irrelevant code after finishing its reasoning steps. The final answer for math datasets is fetched using whatever is the last float number present in the response, however with some random code at the end, it fetches the numeric answer from the gibberish text instead of the actual answer, resulting in a drop in the accuracy.

Here's an example:

  1. Add the number of pages of math homework and reading homework: 5 + 2 = 7
  2. Subtract the number of pages of reading homework from the number of pages of math homework: 5 - 2 = 3
  3. The difference between the number of pages of math homework and reading homework is 3.

Therefore, Rachel had 3 more pages of math homework than reading homework. The answer in Arabic numerals is 3.
</s><s>Tags: javascript, jquery, html, css, twitter-bootstrap

Question: How to add a class to an element when another element's value changes?

I'm trying to add a class to an element when another element's value changes. Here's an example of what I'm trying to do:

\begin{code}
input type="text" class="form-control" id="amount" value="100"
input type="text" class="form-control" id="amount" value="200"
input type="text" class="form-control" id="amount" value="300"
prediction: 300.0
label: 3.0

Here it treats 300 as the answer since thats the last number in the generated response, while the actual reasoning by Llama is correct in the first half of the generation. Anyone knows how to fix this? Thanks!

Edit: Also here's my ft command:
CUDA_VISIBLE_DEVICES=7 python finetune.py > finetune_llama7_singlegpu_old_transformers.txt --base_model 'yahma/llama-7b-hf' --data_path 'ft-training_set/math_10k.json' --output_dir './trained_models/llama-7b-lora-math-single-gpu-old-transformers/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants