Question about version number in German NLU Benchmark #412

stefan-it · 2024-04-22T11:28:18Z

stefan-it
Apr 22, 2024

Hi everyone!

Many thanks for releasing this great evaluation benchmark - it helps a lot during my research/development of language models.

I have one question regarding to the reported performance scores of German NLU dataset, mainly for GermEval.

So I am using the latest main version and here are my performance comparisons with the current Leaderboard - this should show some discrepancy:

Model	Dataset version	Website Performance	Local Performance
`dbmdz/bert-base-german-cased`	0.0.0	80.90 ± 0.56 / 79.02 ± 0.65	78.07% ± 0.95% / 76.97% ± 1.04%
`deepset/gbert-base`	0.0.0	81.33 ± 0.53 / 80.25 ± 0.48	78.79% ± 0.93% / 77.34% ± 0.86%
`gwlms/deberta-base-dewiki-v1`	12.6.1	79.67 ± 1.47 / 78.63 ± 1.31	79.75% ± 1.40% / 78.63% ± 1.30%

So my assumption is, that an old version of the GermEval dataset could be used for measuring the performance. And these results are now mixed with results from a more recent version of the dataset.

But this is just an assumption, as I could not reproduce the results for dbmdz/bert-base-german-cased and deepset/gbert-base - the performance difference is very very high!

saattrupdan · 2024-04-22T13:42:45Z

saattrupdan
Apr 22, 2024
Maintainer

Hi @stefan-it, and thanks for your feedback! Are you running your evaluations using vLLM? In other words, when you evaluate, you don't get a logging message saying something akin to "Evaluation failed with vLLM - trying Hugging Face instead"? That's at least one aspect which can cause variation in the results.

If not, I'll look into getting the models re-evaluated to have proper comparable scores on the leaderboards 🙂

2 replies

stefan-it Apr 22, 2024
Author

Hi @saattrupdan ,

thanks for the quick reply! This is the complete output I got when running it locally:

(dev) stefan@stefan-ai:~/ScandEval$ scandeval --model gwlms/deberta-base-dewiki-v1 --task named-entity-recognition --language de
2024-04-22 10:37:19 ⋅ Flash attention has not been installed, so this will not be used. To install it, run `pip install -U wheel && FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn --no-build-isolation`. Alternatively, you can disable this message by setting the flag `--no-use-flash-attention`.
2024-04-22 10:37:19 ⋅ Benchmarking gwlms/deberta-base-dewiki-v1 on the truncated version of the German named entity recognition dataset GermEval
2024-04-22 10:37:19.571648: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-22 10:37:19.596210: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-22 10:37:20.073334: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-04-22 10:37:34 ⋅ Loading model and tokenizer...
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 984/984 [00:00<00:00, 2.99MB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 447M/447M [00:10<00:00, 42.0MB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 458/458 [00:00<00:00, 1.58MB/s]
spm.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 815k/815k [00:00<00:00, 4.41MB/s]
2024-04-22 10:37:53 ⋅ The model has 111,804,672 parameters, a vocabulary size of 32,008, and a maximum sequence length of 512.
Preprocessing data splits: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:01<00:00,  8.80it/s]
Benchmarking:  30%|█████████████████████████████████                                                                             | 3/10 [11:42<27:10, 232.92s/it]
                                                                                                                                                                 
Benchmarking: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [37:39<00:00, 225.95s/it]
2024-04-22 11:15:55 ⋅ Finished evaluation of gwlms/deberta-base-dewiki-v1 on the truncated version of the German named entity recognition dataset GermEval.      
2024-04-22 11:15:55 ⋅ Micro-average F1-score without MISC tags: 79.75% ± 1.40%                                                                                   
2024-04-22 11:15:55 ⋅ Micro-average F1-score with MISC tags: 78.63% ± 1.30%

saattrupdan Apr 22, 2024
Maintainer

Thanks! Seems like it is using vLLM in that case - I'll queue up some evaluations in that case 🙂

saattrupdan · 2024-05-23T12:23:31Z

saattrupdan
May 23, 2024
Maintainer

This should be fixed now, after having re-run the evaluations. If you feel like something is still missing, then feel free to re-open! 🙂

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about version number in German NLU Benchmark #412

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question about version number in German NLU Benchmark #412

stefan-it Apr 22, 2024

Replies: 2 comments · 2 replies

saattrupdan Apr 22, 2024 Maintainer

stefan-it Apr 22, 2024 Author

saattrupdan Apr 22, 2024 Maintainer

saattrupdan May 23, 2024 Maintainer

stefan-it
Apr 22, 2024

Replies: 2 comments 2 replies

saattrupdan
Apr 22, 2024
Maintainer

stefan-it Apr 22, 2024
Author

saattrupdan Apr 22, 2024
Maintainer

saattrupdan
May 23, 2024
Maintainer