Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce the commense results on Boolq #64

Open
Zhenyu001225 opened this issue Apr 9, 2024 · 15 comments
Open

Reproduce the commense results on Boolq #64

Zhenyu001225 opened this issue Apr 9, 2024 · 15 comments

Comments

@Zhenyu001225
Copy link

Zhenyu001225 commented Apr 9, 2024

When I'm doing the evaluation, should I use --load_8bit? I'm trying to reproduce the results of LLaMa-7B-LoRA

Finetune:
CUDA_VISIBLE_DEVICES=8 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path './ft-training_set/commonsense_170k.json' --output_dir './trained_models/llama-7b-lora-commonsense/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

Evaluate:
CUDA_VISIBLE_DEVICES=3 python commonsense_evaluate.py
--model LLaMA-7B
--adapter LoRA
--dataset boolq
--batch_size 1
--base_model 'yahma/llama-7b-hf'
--lora_weights './trained_models/llama-7b-lora-commonsense/'

But the result is only 57.5 compared with the table 68.9..
Could you provide me with some insights here?

@Zhenyu001225
Copy link
Author

And for PIQA the result is 74.6 compared with data in table 80.7.
For Siqa the result is 60.8 compared with data in table 77.4
Should I finetune again? Or adjusting any of the hypermeters

@lucasliunju
Copy link

Hi May I ask whether you solve this issue now?

@wutaiqiang
Copy link

btw, I find that a larger batch size would lead to some bad output while bsz=1 not.

@lucasliunju
Copy link

@wutaiqiang Yes, I also find this problem and bsz=1 can solve the most case, it can still output BAD result for some case.

@wutaiqiang
Copy link

In my case, the results are even better than reported. You should use one GPU in finetuning.

@wutaiqiang
Copy link

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8

boolq | piqa | social_i_qa | hellaswag | winogrande | ARC-Easy | ARC-Challenge | openbookqa

@wutaiqiang
Copy link

For llama 7B + lora

@lucasliunju
Copy link

Hi @wutaiqiang Thanks for your data point. I try to change the base model from "float16" to "float32" or "bfloat16" and I find the output result is not very stable.

@Zhenyu001225
Copy link
Author

Hi May I ask whether you solve this issue now?

Hi, I change the version of transformers to 4.35.0 and when doing evaluation batch_size=1.

Now the results are :

Model Gsm8k SVAMP AuQA MultiArith SingleEq AddSub
LLama-7B-LoRA-math 37.9 47.0 19.68 97.5 85.83 83.54
Model BoolQ SiQA SIQA Hellaswag Winogrande ARC-c ARC-e OpenBookQA Average
LLama-7B-LoRA-Commense 64.01 80.25 77.28 76.50 79.79 62.54 77.31 77.4 74.39

@Zhenyu001225
Copy link
Author

For llama 7B + lora

Hi, what is the version of transformers in your case?

@wutaiqiang
Copy link

4.32.1

@Zhenyu001225
Copy link
Author

4.32.1

Thank you so much~ I'll try again

@clarenceluo78
Copy link

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8

boolq | piqa | social_i_qa | hellaswag | winogrande | ARC-Easy | ARC-Challenge | openbookqa

Hi there, I want to ask if you use the 8bit quantiztation when reproduce?

@Zhenyu001225
Copy link
Author

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8
boolq | piqa | social_i_qa | hellaswag | winogrande | ARC-Easy | ARC-Challenge | openbookqa

Hi there, I want to ask if you use the 8bit quantiztation when reproduce?

I didn't open the 8-bit quantization

@wutaiqiang
Copy link

After rerun, the results are

68.13 80.3 78.45 83.11 80.66 77.23 65.78 79.4
boolq piqa social_i_qa hellaswag winogrande ARC-Easy ARC-Challenge openbookqa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants