Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the training data format of commitpack-ft and oasst when finetune codegeex2? #9

Open
sxthunder opened this issue Aug 18, 2023 · 6 comments

Comments

@sxthunder
Copy link

In your paper, commitpack using following format to train:
Question: <commit_before>xxx<commit_msg>
Answer: <commit_after>xxx

but in codegeex2's vocabulary, no special token like <commit_before> <commit_msg> added. I download the checkpoint of octogeex and using this format predict, the answer is wrong.

can you explain more specifily about how you transfer commitpack_ft and oasst to finetune data format?
(what's the input and what's the output)

Thanks

@Muennighoff
Copy link
Collaborator

We format samples into a simple Q & A format for OctoCoder & OctoGeeX:

For CommitPackFT:

Question: {subject}
{old_contents}

Answer: 
{new_contents}

For OASST:

Question: {input}

Answer: 
{output}

So we do not rely on any special tokens. We only use those special tokens for pretraining / fine-tuning on StarCoder & SantaCoder in the appendix. Let me know if something is unclear!

@sxthunder
Copy link
Author

We format samples into a simple Q & A format for OctoCoder & OctoGeeX:

For CommitPackFT:

Question: {subject}
{old_contents}

Answer: 
{new_contents}

For OASST:

Question: {input}

Answer: 
{output}

So we do not rely on any special tokens. We only use those special tokens for pretraining / fine-tuning on StarCoder & SantaCoder in the appendix. Let me know if something is unclear!

Thank you!

@sxthunder
Copy link
Author

I have two other quesiotns

  1. In your script ./finetuning/starcoder/finetune.py: I find out that training samples directly concats without padding, like pretrain stage. This is different from many finetune script. Is this just for speed up training process?
  2. Instruction tuning on code pretrain model enables it understands human instructions, and improve its score on many benchmarks. But for code completion in IDE enviroment(Like colipot or codegeex), which kind of model is more suitable? Pretrain or instruction?

@Muennighoff
Copy link
Collaborator

  1. Yes this is called packing. It's to make it more efficient.
  2. For code completion in your IDE, where you just want suggestions to directly continue your code, a pretrained model is likely more suitable. I.e. I would recommend StarCoder, not OctoCoder in that case. However, if you want a model to do something specific for you, such as "Write a function to do bubble sort", I'd recommend OctoCoder. You might be able to get StarCoder to do it via comments, but then it might just end up writing # pass in the code or fail in other ways. Further, if you want to edit code or explain code, I'd also recommend OctoCoder.

@sxthunder
Copy link
Author

Sorry to bother you again:

  1. In Readme.md, it shows that "OctoGeeX is finetuned based on [CodeGeeX2-6B (https://huggingface.co/THUDM/codegeex2-6b) using an internal training framework." Is there any plan to opensource this part? Does finetuning/starcoder/finetune.py can train the same model?
  2. In OctoGeex2's training hyperparameters, it shows octogeex2 only trains 50 Steps. But commitpack_ft have nearly 0.7M samples, Is this a mistake?

@Muennighoff
Copy link
Collaborator

Sorry to bother you again:

  1. In Readme.md, it shows that "OctoGeeX is finetuned based on [CodeGeeX2-6B (https://huggingface.co/THUDM/codegeex2-6b) using an internal training framework." Is there any plan to opensource this part? Does finetuning/starcoder/finetune.py can train the same model?
  2. In OctoGeex2's training hyperparameters, it shows octogeex2 only trains 50 Steps. But commitpack_ft have nearly 0.7M samples, Is this a mistake?

Any questions are very welcome!

  1. Unfortunately, we cannot open-source that framework, however, finetuning/starcoder/finetune.py should be able to train the same model.
  2. Yes we found that performance plateaus after a few steps; We thus only use a subset of CommitPackFT (For OctoGeeX the exact dataset used for fine-tuning is uploaded here: https://huggingface.co/datasets/bigcode/co-manual)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants