Skip to content

Releases: ScandEval/ScandEval

v12.10.4

03 Jun 13:02
4822f25
Compare
Choose a tag to compare

Fixed

  • Access to the evaluation datasets were shut down by Hugging Face again. It has now
    been restored.

v12.10.3

03 Jun 11:35
b5cf7d6
Compare
Choose a tag to compare

Fixed

  • Access to the evaluation datasets were shut down by Hugging Face. It has now been restored.

v12.10.2

30 May 13:53
2b10959
Compare
Choose a tag to compare

Fixed

  • Correctly update logits processors and prefix allowed functions tokens functions for
    NER datasets when starting generation.
  • We now use logprobs for OpenAI models, as this is supported by the chat models now.
    This is used for all sequence classification based tasks, which currently comprise of
    sentiment classification, linguistic acceptability, knowledge and common-sense
    reasoning. This fixes some incorrect evaluations of the newer GPT-4-turbo and GPT-4o
    models, as they tend to output things like "Sentiment: positive" rather than simply
    "positive".

v12.10.1

28 May 10:24
b06882f
Compare
Choose a tag to compare

Fixed

  • Now recognises the metadata for the new GPT-4o models correctly. Currently there is a version clash between vllm and tiktoken, meaning that one needs to manually upgrade tiktoken to evaluate GPT-4o - an informative error message notes this to the user now in that case.
  • Number of generated tokens for sequence classification tasks has been changed back to 1 (from 3). This makes no difference to open source models, as we only use the logprobs from the first token anyway, but this makes a big difference on multiple choice QA tasks for OpenAI models, as some of them might output things like "a is correct" rather than simply "a". Since we're using word edit distance to the labels, this might accidentally cause the final prediction to be different from "a".
  • An error in outlines<=0.0.36 meant that NER evaluations were near-random. Unfortunately, due to a strict outlines requirement in vllm, we cannot enforce outlines>=0.0.37 (see this vLLM PR for a future fix). For now, to prevent faulty evaluations, we raise an error, asking the user to manually upgrade outlines if they have an old version.

v12.10.0

08 May 11:38
c1932e6
Compare
Choose a tag to compare

Changed

  • Update autoawq to >=0.2.5,<0.3.0, as it now doesn't have a dependency clash with
    transformers.
  • Update vllm to >=0.4.2,<0.5.0, to support new models (such as Phi-3).
  • Update torch to >=2.3.0,<3.0.0, as this is required by vllm.

Fixed

  • When overriding benchmark configuration parameters in Benchmarker.benchmark then
    these overridden parameters are now correctly used when building datasets.
  • When a generative model was benchmarked on a NER task followed by another task, the
    structured generation wasn't set up correctly, as we're not re-initialising the model
    since v12.8.0. We now ensure that the logits processors are re-built for every
    dataset.

v12.9.1

30 Apr 10:44
ad77ef1
Compare
Choose a tag to compare

Fixed

  • Disables the prefix caching of vLLMs, as it has not been implemented with sliding
    window attention yet, causing re-initialisation errors.
  • Updates vllm to >=0.4.1,<0.5.0, as this fixes an issue with benchmarking
    freezing.

v12.9.0

26 Apr 14:47
bbb4b3e
Compare
Choose a tag to compare

Changed

  • Update optimum dependency to >=1.19.1,<2.0.0, as it is now compatible with
    transformers>=4.40.0,<4.41.0.

Fixed

  • Pin vllm to v0.4.0, since v0.4.1 has breaking changes and is causing issues
    with flash attention.
  • Catch vLLM error when prefix caching is set for models with sliding window attention,
    as this is not supported yet in vLLM.

v12.8.0

23 Apr 14:15
9dea4e7
Compare
Choose a tag to compare

Changed

  • Updated vllm to >=0.4.0,<0.5.0, which both fixes an issue with multi-gpu
    benchmarking as well as supporting more models.
  • Updated transformers to >=4.40.0,<4.41.0, to support more models.
  • Removed the olmo extra, as it is now included in transformers.
  • Downgraded outlines to v0.0.34 as any newer version is currently incompatible
    with vllm. This will be changed back to newer versions when this vLLM
    PR
    has been merged and released.

Fixed

  • Now does not reload generative models between each evaluation. This both saves some
    evaluation time, but it also prevents a bug when using multiple GPUs.
  • Handle the change from having float logprobs in vLLM to the new Logprob objects.

v12.7.0

19 Apr 14:41
5d9c9e1
Compare
Choose a tag to compare

Added

  • Added a script to evaluate human performance on datasets. This is a Gradio app which
    can be run using the command human_evaluate --annotator-id <id>, where
    annotator-id is the ID of the human annotator (from 0 to 10, inclusive). They will
    then annotate their answers for validation splits from the iteration corresponding to
    their annotator ID. All of the annotated results will be stored to
    scandeval_benchmark_results.jsonl, as usual - note here that this will create a
    single human entry, where multiple annotators will count as multiple iterations for
    the same human model.

Fixed

  • If a model has a very small maximal context length in its tokeniser configuration
    then we ignore this value and instead use the default value.
  • When a model is generative then we use default context length to be 32,768.
  • Now ensures that we use mixed precision when CUDA is available, as this is required
    by Flash Attention.
  • By default we only use flash attention for generative models, as it leads to errors
    with several encoder models.
  • Add missing OpenAI models to the model cache, to checking model existence when no
    OpenAI key is specified.
  • Only imports from the openai package if it has been installed.
  • Improved detection of the end-of-chat tokens for instruction tuned models, which
    previously caused errors when evaluating some instruction tuned models.
  • Loading of a pretrained model configuration from the Hugging Face Hub failed when the
    model is gated and when the cache_dir is specified in AutoConfig.from_pretrained.
    We now do not set that argument if the model is gated, as a temporary fix.

v12.6.1

11 Apr 13:06
5b6132d
Compare
Choose a tag to compare

Fixed

  • Changed vLLM inference parameters to limit the GPU memory usage during evaluation,
    which makes it possible to evaluate larger models on the same hardware as previously.
    Concretely, the gpu_memory_utilization has been raised from 0.9 to 0.95,
    enforce_eager is set to True, the max_model_len has been reduced from (at most)
    10,000 to (at most) 5,000. See this
    issue
    for an overview of maximum
    amount of tokens in each dataset (as of v12.6.0 of ScandEval).
  • Removed 1 sample from the Swedish sentiment classification dataset SweReC which was
    abnormally long, to keep the maximum amount of tokens in the samples below 5,000.
    Replaced the outlier sample with a new one.
  • The number of allowed generated tokens for the Danish summarisation dataset
    Nordjylland News was mistakenly set to 128, compared to 256 for all other
    summarisation datasets. This has been fixed now.
  • Now correctly detects if autoawq should be installed, when evaluating an AWQ model.
  • Reduced transformers dependency to 4.38.x again, as autoawq requires this.
  • Do not use BitsAndBytes quantisation if the model is already quantised.