Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing the OctoCoder model #17

Open
mstallone opened this issue Sep 1, 2023 · 12 comments
Open

Reproducing the OctoCoder model #17

mstallone opened this issue Sep 1, 2023 · 12 comments

Comments

@mstallone
Copy link

mstallone commented Sep 1, 2023

Hello, I have a few questions about OctoCoder.

For this part in the paper:

For instruction tuning our models, we select 5,000 random samples from COMMITPACKFT across the 6 programming languages that we evaluate on.

Could you please provide the exact training data and the launch script to fine-tune StarCoder into OctoCoder?

Or, the seeds that you used for selecting 5,000 instructions from CommitPackFT?

For a second question, was OctoCoder and the results in the paper produced using the finetuning/starcoder/finetune.py with LoRA/peft?

Thanks!

Btw, fantastic results @Muennighoff and team :)

@Muennighoff
Copy link
Collaborator

I think this is the exact dataset we used for OctoCoder: https://huggingface.co/datasets/bigcode/guanaco-commits

Yes, we used LoRA for OctoCoder.

cc @ArmelRandy

@awasthiabhijeet
Copy link

Hi @Muennighoff , @ArmelRandy:
Did you find full-finetuning of starcoder to be better than LoRa/PEFT ?
(I'm a bit confused since the paper doesn't mention the use of LoRA / PEFT techniques.)

@Muennighoff
Copy link
Collaborator

We did not find a significant difference between LoRA and full fine-tuning, thus we use LoRA for all experiments.

Sorry for that. I have added the above as a note in Appendix M Hyperparameters. We will update the arXiv version in a few months.

@awasthiabhijeet
Copy link

Hi @Muennighoff ,

I think this is the exact dataset we used for OctoCoder: https://huggingface.co/datasets/bigcode/guanaco-commits

The above dataset contains 13K samples. However, from the paper it seems ~23K samples were used for training OctoCoder.

image

Am I missing something?

@Muennighoff
Copy link
Collaborator

For OctoCoder, we use OASST + CommitPackFT so 8587 + 5000 ~ 13,000
The others are only used in the ablations

@awasthiabhijeet
Copy link

For OctoCoder, we use OASST + CommitPackFT so 8587 + 5000 ~ 13,000 The others are only used in the ablations

Thanks! :)

@mstallone
Copy link
Author

Great! Appreciate the response.

Could you also clarify the environments used in evaluation? We are seeing discrepancies between the paper and our eval results up by 10% on OctoCoder. Perhaps you could specify the build versions of languages? I see you just specify the latest stable Rust in the code, for example.

@Muennighoff
Copy link
Collaborator

Sure, these are the evaluation versions:

Python: Python 3.9.13 torch 1.13.0+rocm5.2 accelerate 0.20.3 transformers 4.32.1
C++: 11.4.0 (but newer ones should be fine too)
JS: js-md5@0.7.3
Java: java version "18" 2022-03-22
Go: go1.18.4
Rust: rustc 1.71.1 (eb26296b5 2023-08-03)

Also HumanEval performance is noisy as there's only 164 samples per task per subset. You may find that a different seed or checkpoint from a different step could make up for that 10% relative difference on Python HumanEvalSynthesize.

Other people have been able to reproduce the results, someone even got 46.5 pass@1 on Python by just reevaluating OctoCoder with our script, better than our paper. Probably due to different versions or batch size setting.

@JunHyungKang
Copy link

JunHyungKang commented Sep 21, 2023

We did not find a significant difference between LoRA and full fine-tuning, thus we use LoRA for all experiments.

Sorry for that. I have added the above as a note in Appendix M Hyperparameters. We will update the arXiv version in a few months.

@Muennighoff From the Appendix, there is a description that "OCTOCODER was trained for 35 steps with a sequence length of 2048".
In my opinion, with a length of 2048 and a step of 35, it seems like the entire dataset won't be fully covered (approximately 60,000 tokens only??). Am I understanding this correctly?

@Muennighoff
Copy link
Collaborator

Muennighoff commented Sep 21, 2023

Note that it's 2.2 million total finetuning tokens due to the batch size of 32. The steps & seq len is correct - You usually do not need many tokens for instruction tuning, see e.g. the below graph from prior work

Screenshot 2023-09-21 at 10 51 35 AM

@SeanHeelan
Copy link

@Muennighoff How did you decide on the number of samples from CommitPackFT to use during fine tuning? i.e. where did the 5k number come from? Your graph above seems to indicate increased performance for the BLOOMZ model during fine-tuning well into the 100s of millions of tokens, and I've seen other fine-tunings of Llama-2 using training sets that vary from ~5k all the way up to ~80k, for similar-ish tasks. I am curious what insights/experiences you used to come up with 5k.

@Muennighoff
Copy link
Collaborator

The 5K was mostly arbitrary. Our filtered OASST dataset had around 5K so we just decided to fix it at 5K for CommitPackFT, too. You can probably use more.

You are right that perf improves into the 100s of millions for BLOOMZ; mT0 seems to saturate earlier. It could be that fine-tuning OctoCoder for longer would lead to better performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants