Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of StarCoder on HumanEvalFixDocs #21

Open
awasthiabhijeet opened this issue Sep 6, 2023 · 9 comments
Open

Performance of StarCoder on HumanEvalFixDocs #21

awasthiabhijeet opened this issue Sep 6, 2023 · 9 comments

Comments

@awasthiabhijeet
Copy link

With StarCoder, I am observing a pass@1 score of 58.9 instead of 43.5 as reported in the OctoCoder paper.

Script used:

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 2048 \
--precision fp16

Results:

{
  "humanevalfixdocs-python": {
    "pass@1": 0.589329268292683,
    "pass@10": 0.6989868047455075
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "fp16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "starcoder/evaluation_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_generations": true,
    "save_generations_path": "starcoder/generations_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

CC: @Muennighoff

@Muennighoff
Copy link
Collaborator

A few things are different in the command we ran: We use --bf16 instead of fp16, --max_length_generation 1800 & --batch_size 5. All of them can slightly affect the score though I would be surprised if by so much.
You can verify the 43.5 we got here https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/evaluation_humanevalfixdocspy_starcoder_temp02.json & the generations here https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/generations_humanevalfixdocspy_starcoder_temp02.json. If you want you can directly compare the generations to yours to see where the discrepancies may be.

Overall, yeah the commit format on the pretrained StarCoder works really well. On the regular HumanEvalFix, StarCoder + Commit Format also outperforms OctoCoder, see the below Table from Appendix G. The problem of the commit format is that it does not work well for code synthesis or explanation.

Screenshot 2023-09-06 at 12 51 42 PM

@awasthiabhijeet
Copy link
Author

awasthiabhijeet commented Sep 7, 2023

On the regular HumanEvalFix, StarCoder + Commit Format also outperforms OctoCoder, see the below Table from Appendix G.

This is helpful. Thanks! I feel this deserves a mention in Table-2 itself then :)

Could you also share the script you use to obtain https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/evaluation_humanevalfixdocspy_starcoder_temp02.json?
I can try re-running it in the exact same config that you used.

Thanks!

@Muennighoff
Copy link
Collaborator

Sure it would be:

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 1800 \
--precision bf16

@awasthiabhijeet
Copy link
Author

Sure it would be:

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 1800 \
--precision bf16

With this script, I observe a pass@1 score of 60.1.

{
  "humanevalfixdocs-python": {
    "pass@1": 0.6009146341463415,
    "pass@10": 0.6974812593960444
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 5,
    "max_length_generation": 1800,
    "precision": "bf16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "starcoder/evaluation_humanevalfixdocspython_starcodercommit_prompt_bf16.json",
    "save_generations": true,
    "save_generations_path": "starcoder/generations_humanevalfixdocspython_starcodercommit_prompt_bf16.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

@awasthiabhijeet
Copy link
Author

CC: @Muennighoff

@Muennighoff
Copy link
Collaborator

You're right, it seems the result in the paper is too low. I reran it & got the below:

{
  "humanevalfixdocs-python": {
    "pass@1": 0.5878048780487805,
    "pass@10": 0.6939082542089792
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 5,
    "max_length_generation": 1800,
    "precision": "bf16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_humanevalfixdocspython_starcoder_temp02_commit.json",
    "save_generations": true,
    "save_generations_path": "generations_humanevalfixdocspython_starcoder_temp02_commit.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

I will update the paper soon. Thanks a lot for noting this!

@awasthiabhijeet
Copy link
Author

Thanks @Muennighoff :)

@Muennighoff
Copy link
Collaborator

Attached is how the new section will look like including the updated results. Thanks again!

Screenshot 2023-09-16 at 11 53 45 AM

@awasthiabhijeet
Copy link
Author

Thanks @Muennighoff, this is very helpful! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants