What is the training data format of commitpack-ft and oasst when finetune codegeex2? #9

sxthunder · 2023-08-18T07:02:18Z

In your paper, commitpack using following format to train:
Question: <commit_before>xxx<commit_msg>
Answer: <commit_after>xxx

but in codegeex2's vocabulary, no special token like <commit_before> <commit_msg> added. I download the checkpoint of octogeex and using this format predict, the answer is wrong.

can you explain more specifily about how you transfer commitpack_ft and oasst to finetune data format?
(what's the input and what's the output)

Thanks

Muennighoff · 2023-08-18T07:08:54Z

We format samples into a simple Q & A format for OctoCoder & OctoGeeX:

For CommitPackFT:

Question: {subject}
{old_contents}

Answer: 
{new_contents}

For OASST:

Question: {input}

Answer: 
{output}

So we do not rely on any special tokens. We only use those special tokens for pretraining / fine-tuning on StarCoder & SantaCoder in the appendix. Let me know if something is unclear!

sxthunder · 2023-08-18T07:19:37Z

We format samples into a simple Q & A format for OctoCoder & OctoGeeX:

For CommitPackFT:
Question: {subject}
{old_contents}

Answer: 
{new_contents}
For OASST:
Question: {input}

Answer: 
{output}
So we do not rely on any special tokens. We only use those special tokens for pretraining / fine-tuning on StarCoder & SantaCoder in the appendix. Let me know if something is unclear!

Thank you!

sxthunder · 2023-08-18T07:25:58Z

I have two other quesiotns

In your script ./finetuning/starcoder/finetune.py: I find out that training samples directly concats without padding, like pretrain stage. This is different from many finetune script. Is this just for speed up training process?
Instruction tuning on code pretrain model enables it understands human instructions, and improve its score on many benchmarks. But for code completion in IDE enviroment(Like colipot or codegeex), which kind of model is more suitable? Pretrain or instruction?

Muennighoff · 2023-08-18T07:59:07Z

Yes this is called packing. It's to make it more efficient.
For code completion in your IDE, where you just want suggestions to directly continue your code, a pretrained model is likely more suitable. I.e. I would recommend StarCoder, not OctoCoder in that case. However, if you want a model to do something specific for you, such as "Write a function to do bubble sort", I'd recommend OctoCoder. You might be able to get StarCoder to do it via comments, but then it might just end up writing # pass in the code or fail in other ways. Further, if you want to edit code or explain code, I'd also recommend OctoCoder.

sxthunder · 2023-08-20T07:56:22Z

Sorry to bother you again:

In Readme.md, it shows that "OctoGeeX is finetuned based on [CodeGeeX2-6B (https://huggingface.co/THUDM/codegeex2-6b) using an internal training framework." Is there any plan to opensource this part? Does finetuning/starcoder/finetune.py can train the same model?
In OctoGeex2's training hyperparameters, it shows octogeex2 only trains 50 Steps. But commitpack_ft have nearly 0.7M samples, Is this a mistake?

Muennighoff · 2023-08-20T08:20:36Z

Sorry to bother you again:

In Readme.md, it shows that "OctoGeeX is finetuned based on [CodeGeeX2-6B (https://huggingface.co/THUDM/codegeex2-6b) using an internal training framework." Is there any plan to opensource this part? Does finetuning/starcoder/finetune.py can train the same model?

In OctoGeex2's training hyperparameters, it shows octogeex2 only trains 50 Steps. But commitpack_ft have nearly 0.7M samples, Is this a mistake?

Any questions are very welcome!

Unfortunately, we cannot open-source that framework, however, finetuning/starcoder/finetune.py should be able to train the same model.
Yes we found that performance plateaus after a few steps; We thus only use a subset of CommitPackFT (For OctoGeeX the exact dataset used for fine-tuning is uploaded here: https://huggingface.co/datasets/bigcode/co-manual)

SeanHeelan mentioned this issue Dec 2, 2023

Commit message from CommitPack unused? #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the training data format of commitpack-ft and oasst when finetune codegeex2? #9

What is the training data format of commitpack-ft and oasst when finetune codegeex2? #9

sxthunder commented Aug 18, 2023

Muennighoff commented Aug 18, 2023

sxthunder commented Aug 18, 2023

sxthunder commented Aug 18, 2023

Muennighoff commented Aug 18, 2023

sxthunder commented Aug 20, 2023

Muennighoff commented Aug 20, 2023

What is the training data format of commitpack-ft and oasst when finetune codegeex2? #9

What is the training data format of commitpack-ft and oasst when finetune codegeex2? #9

Comments

sxthunder commented Aug 18, 2023

Muennighoff commented Aug 18, 2023

sxthunder commented Aug 18, 2023

sxthunder commented Aug 18, 2023

Muennighoff commented Aug 18, 2023

sxthunder commented Aug 20, 2023

Muennighoff commented Aug 20, 2023