Warning
The project is under active development. Please check back later for more updates.
Warning
Please use WildCode with caution. Different from EvalPlus, WildCode has a much less constrained execution environment to support tasks with diverse library dependencies. This may lead to security risks. We recommend using a sandbox such as Docker to run the evaluation.
π³About β’ π₯Quick Start β’ π»LLM code β’ πFailure inspection β’ πKnown issues β’ πCitation β’ πAcknowledgement
WildCodeBench is a rigorous benchmark for code generation with realistic constraints in the wild. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more fine-grained descriptions and diverse tool use.
To facilitate the evaluation of LLMs on WildCodeBench, we provide a Python package wild-code
that includes the dataset, generation scripts, and evaluation scripts. The package is built on top of the EvalPlus framework, which is a flexible and extensible evaluation framework for code generation tasks.
WildCode is a rigorous evaluation framework for LLM4Code, with:
- β¨ Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
- β¨ Pre-generated samples: WildCode accelerates code intelligence research by open-sourcing LLM-generated samples for various models -- no need to re-run the expensive benchmarks!
We inherit the design of the EvalPlus framework, which is a flexible and extensible evaluation framework for code generation tasks. However, WildCode has the following differences:
- Execution Environment: The execution environment in WildCode is less bounded than EvalPlus to support tasks with diverse library dependencies.
- Test Evaluation: WildCode relies on
unittest
for evaluating the generated code, which is more suitable for the test harness in WildCodeBench.
Tip
WildCode β€οΈ bigcode-evaluation-harness! WildCodeBench will be integrated to bigcode-evaluation-harness, and you can also run it there!
To get started, please first set up the environment:
pip install wild-code --upgrade
β¬ Install nightly version :: click to expand ::
pip install "git+https://github.com/bigcode-project/wild-code.git" --upgrade
β¬ Using WildCode as a local repo? :: click to expand ::
git clone https://github.com/bigcode-project/wild-code.git
cd wild-code
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -e .
You are suggested to use flash-attn
for generating code samples.
pip install -U flash-attn
To generate code samples from a model, you can use the following command:
wildcode.generate \
--model [model_name] \
--dataset [wildcodebench] \
--nl2code [False|True] \
--greedy \
--bs [bs] \
--temperature [temp] \
--n_samples [n_samples] \
--resume \
--backend [vllm|hf|openai|mistral|anthropic|google]
--tp [gpu_number]
The generated code samples will be stored in a file named [model_name]--wildcodebench-[nl2c|c2c]--[backend]-[temp]-[n_samples].jsonl
.
π€ Structure of `problem`? :: click to expand ::
task_id
is the identifier string for the taskentry_point
is the name of the functionprompt
is the function signature with docstringinstruction
is the instruction for the task completion
canonical_solution
is the ground-truth implementationtest
is theunittest
test case
Note
Expected Schema of [model_name]--wildcodebench-[task]--[backend]-[temp]-[n_samples].jsonl
task_id
: Task ID, which are the keys ofget_wildcodebench()
solution
(optional): Self-contained solution (usually including the prompt)- Example:
{"task_id": "WildCodeBench/?", "solution": "def f():\n return 1"}
- Example:
LLM-generated text may not be compilable code for including natural language lines or incomplete extra code.
We provide a tool namely wildcode.sanitize
to clean up the code:
# π‘ If you are storing codes in jsonl:
wildcode.sanitize --samples samples.jsonl
# Sanitized code will be produced to `samples-sanitized.jsonl`
# π‘ If you are storing codes in directories:
wildcode.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
π Checking the compilability of post-processed code:: click to expand ::
To double-check the post-processing results, you can use wildcode.syncheck
to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:
# π‘ If you are storing codes in jsonl:
wildcode.syncheck --samples samples.jsonl --dataset [wildcodebench]
# π‘ If you are storing codes in directories:
wildcode.syncheck --samples /path/to/vicuna-[??]b_temp_[??] --dataset [wildcodebench]
You are strongly recommended to use a sandbox such as docker:
# mount the current directory to the container
docker run -v $(pwd):/wildcode terryzho/wildcode:latest --dataset wildcodebench --samples samples.jsonl
# ...Or locally β οΈ
wildcode.evaluate --dataset wildcodebench --samples samples.jsonl
...Or if you want to try it locally regardless of the risks
First, install the dependencies for WildCodeBench:
pip install -r https://raw.githubusercontent.com/bigcode-project/wildcodebench-annotation/main/requirements.txt
Then, run the evaluation:
wildcode.evaluate --dataset [wildcodebench] --samples samples.jsonl
Tip
Do you use a very slow machine?
LLM solutions are regarded as failed on timeout (and OOM etc.). Specifically, we set the dynamic timeout based on the ground-truth solution's runtime.
Additionally, you are NOT encouraged to make your test-bed over stressed while running evaluation.
For example, using --parallel 64
on a 4-core machine or doing something else during evaluation are bad ideas...
β¨οΈ More command-line flags :: click to expand ::
--parallel
: by default half of the cores
The output should be like (below is GPT-4 greedy decoding example):
Asserting the groundtruth...
Expected outputs computed in 1200.0 seconds
Reading samples...
1140it [00:00, 1901.64it/s]
Evaluating samples...
100%|ββββββββββββββββββββββββββββββββββββββββββ| 1140/1140 [19:53<00:00, 6.75it/s]
wildcodebench
{'pass@1': 0.568}
- The "k" includes
[1, 5, 10]
where k values<=
the sample size will be used - A cache file named like
samples_eval_results.jsonl
will be cached. Remove it to re-run the evaluation
π€ How long it would take? :: click to expand ::
If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds.
When running 1 sample x 964 tasks x all tests, it can take around ??-?? minutes by using --parallel 64
and --test-details
.
Here are some tips to speed up the evaluation:
- Use
--parallel $(nproc)
- Use our pre-evaluated results (see LLM-generated code)
You can inspect the failed samples by using the following command:
wildcode.inspect --dataset [wildcodebench] --eval-results sample-sanitized_eval_results.json --in-place
We provide a sample script to run the full pipeline:
bash run.sh
We will share pre-generated code samples from LLMs we have evaluated:
-
We notice that some tasks heavily use memory for scientific modeling during testing. It will lead to timeout issues on some machines. If you get an error message like
Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
in Tensorflow, it is very likely due to the memory issue. Try to allocate more memory to the process or reduce the number of parallel processes. -
Due to the flakes in the evaluation, the execution results may vary slightly (~0.5%) between runs. We are working on improving the evaluation stability.