Releases · ScandEval/ScandEval

Correctly update logits processors and prefix allowed functions tokens functions for
NER datasets when starting generation.
We now use logprobs for OpenAI models, as this is supported by the chat models now.
This is used for all sequence classification based tasks, which currently comprise of
sentiment classification, linguistic acceptability, knowledge and common-sense
reasoning. This fixes some incorrect evaluations of the newer GPT-4-turbo and GPT-4o
models, as they tend to output things like "Sentiment: positive" rather than simply
"positive".

Assets 2

28 May 10:24

saattrupdan

v12.10.1

b06882f

v12.10.1

Fixed

Now recognises the metadata for the new GPT-4o models correctly. Currently there is a version clash between vllm and tiktoken, meaning that one needs to manually upgrade tiktoken to evaluate GPT-4o - an informative error message notes this to the user now in that case.
Number of generated tokens for sequence classification tasks has been changed back to 1 (from 3). This makes no difference to open source models, as we only use the logprobs from the first token anyway, but this makes a big difference on multiple choice QA tasks for OpenAI models, as some of them might output things like "a is correct" rather than simply "a". Since we're using word edit distance to the labels, this might accidentally cause the final prediction to be different from "a".
An error in outlines<=0.0.36 meant that NER evaluations were near-random. Unfortunately, due to a strict outlines requirement in vllm, we cannot enforce outlines>=0.0.37 (see this vLLM PR for a future fix). For now, to prevent faulty evaluations, we raise an error, asking the user to manually upgrade outlines if they have an old version.

Assets 2

08 May 11:38

saattrupdan

v12.10.0

c1932e6

v12.10.0

Changed

Update autoawq to >=0.2.5,<0.3.0, as it now doesn't have a dependency clash with
transformers.
Update vllm to >=0.4.2,<0.5.0, to support new models (such as Phi-3).
Update torch to >=2.3.0,<3.0.0, as this is required by vllm.

Fixed

When overriding benchmark configuration parameters in Benchmarker.benchmark then
these overridden parameters are now correctly used when building datasets.
When a generative model was benchmarked on a NER task followed by another task, the
structured generation wasn't set up correctly, as we're not re-initialising the model
since v12.8.0. We now ensure that the logits processors are re-built for every
dataset.

Assets 2

30 Apr 10:44

saattrupdan

v12.9.1

ad77ef1

v12.9.1

Fixed

Disables the prefix caching of vLLMs, as it has not been implemented with sliding
window attention yet, causing re-initialisation errors.
Updates vllm to >=0.4.1,<0.5.0, as this fixes an issue with benchmarking
freezing.

Assets 2

26 Apr 14:47

saattrupdan

v12.9.0

bbb4b3e

v12.9.0

Changed

Update optimum dependency to >=1.19.1,<2.0.0, as it is now compatible with
transformers>=4.40.0,<4.41.0.

Fixed

Pin vllm to v0.4.0, since v0.4.1 has breaking changes and is causing issues
with flash attention.
Catch vLLM error when prefix caching is set for models with sliding window attention,
as this is not supported yet in vLLM.

Assets 2

23 Apr 14:15

saattrupdan

v12.8.0

9dea4e7

v12.8.0

Changed

Updated vllm to >=0.4.0,<0.5.0, which both fixes an issue with multi-gpu
benchmarking as well as supporting more models.
Updated transformers to >=4.40.0,<4.41.0, to support more models.
Removed the olmo extra, as it is now included in transformers.
Downgraded outlines to v0.0.34 as any newer version is currently incompatible
with vllm. This will be changed back to newer versions when this vLLM
PR has been merged and released.

Fixed

Now does not reload generative models between each evaluation. This both saves some
evaluation time, but it also prevents a bug when using multiple GPUs.
Handle the change from having float logprobs in vLLM to the new Logprob objects.

Assets 2

19 Apr 14:41

saattrupdan

v12.7.0

5d9c9e1

v12.7.0

Added

Added a script to evaluate human performance on datasets. This is a Gradio app which
can be run using the command human_evaluate --annotator-id <id>, where
annotator-id is the ID of the human annotator (from 0 to 10, inclusive). They will
then annotate their answers for validation splits from the iteration corresponding to
their annotator ID. All of the annotated results will be stored to
scandeval_benchmark_results.jsonl, as usual - note here that this will create a
single human entry, where multiple annotators will count as multiple iterations for
the same human model.

Fixed

If a model has a very small maximal context length in its tokeniser configuration
then we ignore this value and instead use the default value.
When a model is generative then we use default context length to be 32,768.
Now ensures that we use mixed precision when CUDA is available, as this is required
by Flash Attention.
By default we only use flash attention for generative models, as it leads to errors
with several encoder models.
Add missing OpenAI models to the model cache, to checking model existence when no
OpenAI key is specified.
Only imports from the openai package if it has been installed.
Improved detection of the end-of-chat tokens for instruction tuned models, which
previously caused errors when evaluating some instruction tuned models.
Loading of a pretrained model configuration from the Hugging Face Hub failed when the
model is gated and when the cache_dir is specified in AutoConfig.from_pretrained.
We now do not set that argument if the model is gated, as a temporary fix.

Assets 2

11 Apr 13:06

saattrupdan

v12.6.1

5b6132d

v12.6.1

Fixed

Changed vLLM inference parameters to limit the GPU memory usage during evaluation,
which makes it possible to evaluate larger models on the same hardware as previously.
Concretely, the gpu_memory_utilization has been raised from 0.9 to 0.95,
enforce_eager is set to True, the max_model_len has been reduced from (at most)
10,000 to (at most) 5,000. See this
issue for an overview of maximum
amount of tokens in each dataset (as of v12.6.0 of ScandEval).
Removed 1 sample from the Swedish sentiment classification dataset SweReC which was
abnormally long, to keep the maximum amount of tokens in the samples below 5,000.
Replaced the outlier sample with a new one.
The number of allowed generated tokens for the Danish summarisation dataset
Nordjylland News was mistakenly set to 128, compared to 256 for all other
summarisation datasets. This has been fixed now.
Now correctly detects if autoawq should be installed, when evaluating an AWQ model.
Reduced transformers dependency to 4.38.x again, as autoawq requires this.
Do not use BitsAndBytes quantisation if the model is already quantised.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed

Fixed

Fixed

Fixed

Changed

Fixed

Fixed

Changed

Fixed

Changed

Fixed

Added

Fixed

Fixed

Releases: ScandEval/ScandEval

v12.10.4

Fixed

v12.10.3

Fixed

v12.10.2

Fixed

v12.10.1

Fixed

v12.10.0

Changed

Fixed

v12.9.1

Fixed

v12.9.0

Changed

Fixed

v12.8.0

Changed

Fixed

v12.7.0

Added

Fixed

v12.6.1

Fixed