Releases: ScandEval/ScandEval
Releases · ScandEval/ScandEval
v12.10.4
v12.10.3
Fixed
- Access to the evaluation datasets were shut down by Hugging Face. It has now been restored.
v12.10.2
Fixed
- Correctly update logits processors and prefix allowed functions tokens functions for
NER datasets when starting generation. - We now use logprobs for OpenAI models, as this is supported by the chat models now.
This is used for all sequence classification based tasks, which currently comprise of
sentiment classification, linguistic acceptability, knowledge and common-sense
reasoning. This fixes some incorrect evaluations of the newer GPT-4-turbo and GPT-4o
models, as they tend to output things like "Sentiment: positive" rather than simply
"positive".
v12.10.1
Fixed
- Now recognises the metadata for the new GPT-4o models correctly. Currently there is a version clash between
vllm
andtiktoken
, meaning that one needs to manually upgradetiktoken
to evaluate GPT-4o - an informative error message notes this to the user now in that case. - Number of generated tokens for sequence classification tasks has been changed back to 1 (from 3). This makes no difference to open source models, as we only use the logprobs from the first token anyway, but this makes a big difference on multiple choice QA tasks for OpenAI models, as some of them might output things like "a is correct" rather than simply "a". Since we're using word edit distance to the labels, this might accidentally cause the final prediction to be different from "a".
- An error in
outlines<=0.0.36
meant that NER evaluations were near-random. Unfortunately, due to a strictoutlines
requirement invllm
, we cannot enforceoutlines>=0.0.37
(see this vLLM PR for a future fix). For now, to prevent faulty evaluations, we raise an error, asking the user to manually upgradeoutlines
if they have an old version.
v12.10.0
Changed
- Update
autoawq
to>=0.2.5,<0.3.0
, as it now doesn't have a dependency clash with
transformers
. - Update
vllm
to>=0.4.2,<0.5.0
, to support new models (such as Phi-3). - Update
torch
to>=2.3.0,<3.0.0
, as this is required byvllm
.
Fixed
- When overriding benchmark configuration parameters in
Benchmarker.benchmark
then
these overridden parameters are now correctly used when building datasets. - When a generative model was benchmarked on a NER task followed by another task, the
structured generation wasn't set up correctly, as we're not re-initialising the model
since v12.8.0. We now ensure that the logits processors are re-built for every
dataset.
v12.9.1
Fixed
- Disables the prefix caching of vLLMs, as it has not been implemented with sliding
window attention yet, causing re-initialisation errors. - Updates
vllm
to>=0.4.1,<0.5.0
, as this fixes an issue with benchmarking
freezing.
v12.9.0
Changed
- Update
optimum
dependency to>=1.19.1,<2.0.0
, as it is now compatible with
transformers>=4.40.0,<4.41.0
.
Fixed
- Pin
vllm
tov0.4.0
, sincev0.4.1
has breaking changes and is causing issues
with flash attention. - Catch vLLM error when prefix caching is set for models with sliding window attention,
as this is not supported yet in vLLM.
v12.8.0
Changed
- Updated
vllm
to>=0.4.0,<0.5.0
, which both fixes an issue with multi-gpu
benchmarking as well as supporting more models. - Updated
transformers
to>=4.40.0,<4.41.0
, to support more models. - Removed the
olmo
extra, as it is now included intransformers
. - Downgraded
outlines
tov0.0.34
as any newer version is currently incompatible
withvllm
. This will be changed back to newer versions when this vLLM
PR has been merged and released.
Fixed
- Now does not reload generative models between each evaluation. This both saves some
evaluation time, but it also prevents a bug when using multiple GPUs. - Handle the change from having
float
logprobs in vLLM to the newLogprob
objects.
v12.7.0
Added
- Added a script to evaluate human performance on datasets. This is a Gradio app which
can be run using the commandhuman_evaluate --annotator-id <id>
, where
annotator-id
is the ID of the human annotator (from 0 to 10, inclusive). They will
then annotate their answers for validation splits from the iteration corresponding to
their annotator ID. All of the annotated results will be stored to
scandeval_benchmark_results.jsonl
, as usual - note here that this will create a
singlehuman
entry, where multiple annotators will count as multiple iterations for
the samehuman
model.
Fixed
- If a model has a very small maximal context length in its tokeniser configuration
then we ignore this value and instead use the default value. - When a model is generative then we use default context length to be 32,768.
- Now ensures that we use mixed precision when CUDA is available, as this is required
by Flash Attention. - By default we only use flash attention for generative models, as it leads to errors
with several encoder models. - Add missing OpenAI models to the model cache, to checking model existence when no
OpenAI key is specified. - Only imports from the
openai
package if it has been installed. - Improved detection of the end-of-chat tokens for instruction tuned models, which
previously caused errors when evaluating some instruction tuned models. - Loading of a pretrained model configuration from the Hugging Face Hub failed when the
model is gated and when thecache_dir
is specified inAutoConfig.from_pretrained
.
We now do not set that argument if the model is gated, as a temporary fix.
v12.6.1
Fixed
- Changed vLLM inference parameters to limit the GPU memory usage during evaluation,
which makes it possible to evaluate larger models on the same hardware as previously.
Concretely, thegpu_memory_utilization
has been raised from 0.9 to 0.95,
enforce_eager
is set to True, themax_model_len
has been reduced from (at most)
10,000 to (at most) 5,000. See this
issue for an overview of maximum
amount of tokens in each dataset (as of v12.6.0 of ScandEval). - Removed 1 sample from the Swedish sentiment classification dataset SweReC which was
abnormally long, to keep the maximum amount of tokens in the samples below 5,000.
Replaced the outlier sample with a new one. - The number of allowed generated tokens for the Danish summarisation dataset
Nordjylland News was mistakenly set to 128, compared to 256 for all other
summarisation datasets. This has been fixed now. - Now correctly detects if
autoawq
should be installed, when evaluating an AWQ model. - Reduced
transformers
dependency to4.38.x
again, asautoawq
requires this. - Do not use BitsAndBytes quantisation if the model is already quantised.