Skip to content

bigcode-project/wild-code

Repository files navigation

🌳WildCodeBench

Warning

The project is under active development. Please check back later for more updates.

Warning

Please use WildCode with caution. Different from EvalPlus, WildCode has a much less constrained execution environment to support tasks with diverse library dependencies. This may lead to security risks. We recommend using a sandbox such as Docker to run the evaluation.

🌳About β€’ πŸ”₯Quick Start β€’ πŸ’»LLM code β€’ πŸ”Failure inspection β€’ 🐞Known issues β€’ πŸ“œCitation β€’ πŸ™Acknowledgement

About

WildCodeBench

WildCodeBench is a rigorous benchmark for code generation with realistic constraints in the wild. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more fine-grained descriptions and diverse tool use.

WildCode

To facilitate the evaluation of LLMs on WildCodeBench, we provide a Python package wild-code that includes the dataset, generation scripts, and evaluation scripts. The package is built on top of the EvalPlus framework, which is a flexible and extensible evaluation framework for code generation tasks.

Why WildCode?

WildCode is a rigorous evaluation framework for LLM4Code, with:

  • ✨ Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
  • ✨ Pre-generated samples: WildCode accelerates code intelligence research by open-sourcing LLM-generated samples for various models -- no need to re-run the expensive benchmarks!

Main Differences from EvalPlus

We inherit the design of the EvalPlus framework, which is a flexible and extensible evaluation framework for code generation tasks. However, WildCode has the following differences:

  • Execution Environment: The execution environment in WildCode is less bounded than EvalPlus to support tasks with diverse library dependencies.
  • Test Evaluation: WildCode relies on unittest for evaluating the generated code, which is more suitable for the test harness in WildCodeBench.

πŸ”₯ Quick Start

Tip

WildCode ❀️ bigcode-evaluation-harness! WildCodeBench will be integrated to bigcode-evaluation-harness, and you can also run it there!

To get started, please first set up the environment:

pip install wild-code --upgrade
⏬ Install nightly version :: click to expand ::
pip install "git+https://github.com/bigcode-project/wild-code.git" --upgrade
⏬ Using WildCode as a local repo? :: click to expand ::
git clone https://github.com/bigcode-project/wild-code.git
cd wild-code
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -e .

Code Generation

You are suggested to use flash-attn for generating code samples.

pip install -U flash-attn

To generate code samples from a model, you can use the following command:

wildcode.generate \
    --model [model_name] \
    --dataset [wildcodebench] \
    --nl2code [False|True] \
    --greedy \
    --bs [bs] \
    --temperature [temp] \
    --n_samples [n_samples] \
    --resume \
    --backend [vllm|hf|openai|mistral|anthropic|google]
    --tp [gpu_number]

The generated code samples will be stored in a file named [model_name]--wildcodebench-[nl2c|c2c]--[backend]-[temp]-[n_samples].jsonl.

πŸ€” Structure of `problem`? :: click to expand ::
  • task_id is the identifier string for the task
  • entry_point is the name of the function
  • prompt is the function signature with docstring
  • instruction is the instruction for the task completion
  • canonical_solution is the ground-truth implementation
  • test is the unittest test case

Note

Expected Schema of [model_name]--wildcodebench-[task]--[backend]-[temp]-[n_samples].jsonl

  1. task_id: Task ID, which are the keys of get_wildcodebench()
  2. solution (optional): Self-contained solution (usually including the prompt)
    • Example: {"task_id": "WildCodeBench/?", "solution": "def f():\n return 1"}

Code Post-processing

LLM-generated text may not be compilable code for including natural language lines or incomplete extra code. We provide a tool namely wildcode.sanitize to clean up the code:

# πŸ’‘ If you are storing codes in jsonl:
wildcode.sanitize --samples samples.jsonl
# Sanitized code will be produced to `samples-sanitized.jsonl`

# πŸ’‘ If you are storing codes in directories:
wildcode.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
πŸ”Ž Checking the compilability of post-processed code:: click to expand ::

To double-check the post-processing results, you can use wildcode.syncheck to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:

# πŸ’‘ If you are storing codes in jsonl:
wildcode.syncheck --samples samples.jsonl --dataset [wildcodebench]

# πŸ’‘ If you are storing codes in directories:
wildcode.syncheck --samples /path/to/vicuna-[??]b_temp_[??] --dataset [wildcodebench]

Code Evaluation

You are strongly recommended to use a sandbox such as docker:

# mount the current directory to the container
docker run -v $(pwd):/wildcode terryzho/wildcode:latest --dataset wildcodebench --samples samples.jsonl
# ...Or locally ⚠️
wildcode.evaluate --dataset wildcodebench --samples samples.jsonl

...Or if you want to try it locally regardless of the risks ⚠️:

First, install the dependencies for WildCodeBench:

pip install -r https://raw.githubusercontent.com/bigcode-project/wildcodebench-annotation/main/requirements.txt

Then, run the evaluation:

wildcode.evaluate --dataset [wildcodebench] --samples samples.jsonl

Tip

Do you use a very slow machine?

LLM solutions are regarded as failed on timeout (and OOM etc.). Specifically, we set the dynamic timeout based on the ground-truth solution's runtime.

Additionally, you are NOT encouraged to make your test-bed over stressed while running evaluation. For example, using --parallel 64 on a 4-core machine or doing something else during evaluation are bad ideas...

⌨️ More command-line flags :: click to expand ::
  • --parallel: by default half of the cores

The output should be like (below is GPT-4 greedy decoding example):

Asserting the groundtruth...
Expected outputs computed in 1200.0 seconds
Reading samples...
1140it [00:00, 1901.64it/s]
Evaluating samples...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1140/1140 [19:53<00:00, 6.75it/s]
wildcodebench
{'pass@1': 0.568}
  • The "k" includes [1, 5, 10] where k values <= the sample size will be used
  • A cache file named like samples_eval_results.jsonl will be cached. Remove it to re-run the evaluation
πŸ€” How long it would take? :: click to expand ::

If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds. When running 1 sample x 964 tasks x all tests, it can take around ??-?? minutes by using --parallel 64 and --test-details. Here are some tips to speed up the evaluation:

Failure Inspection

You can inspect the failed samples by using the following command:

wildcode.inspect --dataset [wildcodebench] --eval-results sample-sanitized_eval_results.json --in-place

Full script

We provide a sample script to run the full pipeline:

bash run.sh

πŸ’» LLM-generated Code

We will share pre-generated code samples from LLMs we have evaluated:

Known Issues

  • We notice that some tasks heavily use memory for scientific modeling during testing. It will lead to timeout issues on some machines. If you get an error message like Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed. in Tensorflow, it is very likely due to the memory issue. Try to allocate more memory to the process or reduce the number of parallel processes.

  • Due to the flakes in the evaluation, the execution results may vary slightly (~0.5%) between runs. We are working on improving the evaluation stability.

πŸ“œ Citation

πŸ™ Acknowledgement