What is the goal of this benchmark? #85
Replies: 2 comments 14 replies
-
No def. agree with this assumption
I would say yes.
As it possible now to make comparisons between languages? I feel like this is already quite hard e.g. due to the consistency of annotation or similar. One problem I could see is that if we e.g. make all the Danish tasks more challenging it might seem like the models are just worse at Danish. |
Beta Was this translation helpful? Give feedback.
-
Regarding adding new leaderboards, I'm currently thinking of having the following:
The NLU benchmarks are like the current one, and the NLG benchmarks only contain benchmarks of generative models, and is done on all the NLU and NLG tasks, where the NLU tasks are framed as generative tasks. Thus, there would be a substantial overlap between the NLU and NLG benchmarks. This is not set in stone at all; it's merely a suggestion. The idea is that since the analysis in the ScandEval paper showed that it doesn't make sense to compare Mainland Scandinavian performance with German/Dutch/English/Icelandic/Faroese, it would be misleading to include them in the same leaderboard. The same analysis showed that it makes sense to merge Icelandic and Faroese. And maybe it would also show that German and Dutch should be merged. Don't know about English. I guess one benefit of having separate leaderboards in this way is that it gives a bit more freedom, as we're not supposed to compare across different leaderboards. So, for instance, if German had a special task not present in the other languages, they could include that in their benchmark. |
Beta Was this translation helpful? Give feedback.
-
This is a general discussion about the aims of this benchmark in general. Are we aiming to keep all the datasets across languages consistent with each other, to facilitate comparison between languages? Alternatively, are we aiming to make each dataset as challenging as possible, by making the domains covered in each as diverse as possible (for instance, by making the summarisation task cover both news articles, research articles, and more)?
The original aim of the benchmark was to measure general pretrained language model performance, and as such not measure "the best NER model", for instance. This led to the decision of minimising the training dataset size, to emphasise pretraining knowledge. Has this assumption changed? And regarding the benchmarking of generative models, is this still the goal?
@KennethEnevoldsen @peter-sk @peterbjorgensen
Beta Was this translation helpful? Give feedback.
All reactions