Replies: 2 comments 2 replies
-
Hi @stefan-it, and thanks for your feedback! Are you running your evaluations using vLLM? In other words, when you evaluate, you don't get a logging message saying something akin to "Evaluation failed with vLLM - trying Hugging Face instead"? That's at least one aspect which can cause variation in the results. If not, I'll look into getting the models re-evaluated to have proper comparable scores on the leaderboards 🙂 |
Beta Was this translation helpful? Give feedback.
-
This should be fixed now, after having re-run the evaluations. If you feel like something is still missing, then feel free to re-open! 🙂 |
Beta Was this translation helpful? Give feedback.
-
Hi everyone!
Many thanks for releasing this great evaluation benchmark - it helps a lot during my research/development of language models.
I have one question regarding to the reported performance scores of German NLU dataset, mainly for GermEval.
So I am using the latest main version and here are my performance comparisons with the current Leaderboard - this should show some discrepancy:
dbmdz/bert-base-german-cased
deepset/gbert-base
gwlms/deberta-base-dewiki-v1
So my assumption is, that an old version of the GermEval dataset could be used for measuring the performance. And these results are now mixed with results from a more recent version of the dataset.
But this is just an assumption, as I could not reproduce the results for
dbmdz/bert-base-german-cased
anddeepset/gbert-base
- the performance difference is very very high!Beta Was this translation helpful? Give feedback.
All reactions