What is the goal of this benchmark? #85

saattrupdan · 2023-12-14T11:26:21Z

saattrupdan
Dec 14, 2023
Maintainer

This is a general discussion about the aims of this benchmark in general. Are we aiming to keep all the datasets across languages consistent with each other, to facilitate comparison between languages? Alternatively, are we aiming to make each dataset as challenging as possible, by making the domains covered in each as diverse as possible (for instance, by making the summarisation task cover both news articles, research articles, and more)?

The original aim of the benchmark was to measure general pretrained language model performance, and as such not measure "the best NER model", for instance. This led to the decision of minimising the training dataset size, to emphasise pretraining knowledge. Has this assumption changed? And regarding the benchmarking of generative models, is this still the goal?

@KennethEnevoldsen @peter-sk @peterbjorgensen

KennethEnevoldsen · 2023-12-14T11:34:35Z

KennethEnevoldsen
Dec 14, 2023
Maintainer

The original aim of the benchmark was to measure general pretrained language model performance, and as such not measure "the best NER model", for instance. This led to the decision of minimising the training dataset size, to emphasise pretraining knowledge. Has this assumption changed?

No def. agree with this assumption

And regarding the benchmarking of generative models, is this still the goal?

I would say yes.

This is a general discussion about the aims of this benchmark in general. Are we aiming to keep all the datasets across languages consistent with each other, to facilitate comparison between languages? Alternatively, are we aiming to make each dataset as challenging as possible, by making the domains covered in each as diverse as possible (for instance, by making the summarisation task cover both news articles, research articles, and more)?

As it possible now to make comparisons between languages? I feel like this is already quite hard e.g. due to the consistency of annotation or similar.

One problem I could see is that if we e.g. make all the Danish tasks more challenging it might seem like the models are just worse at Danish.

12 replies

KennethEnevoldsen Dec 14, 2023
Maintainer

I prefer to mean as well (ideally you are really interested in a latent factor, sort of an LMM IQ). However, the mean is so much simpler so I think we should keep it.

peterbjorgensen Dec 15, 2023
Maintainer

I agree that scores across languages might be difficult and I find ranks hard to interpret because it doesn't make sense when models are close in performance. Another solution could be to have normalised scores compared to some base model, so you can say that "this model is 10 percent better at Norwegian than this model". It will be difficult to compare a single model across languages. I think we should strive towards getting the best possible dataset for each language rather than 1-to-1 correspondence between different languages. E.g. it might be that some country has high quality high school assignments available online - then it's better to use those than translate to each language from English.

KennethEnevoldsen Dec 16, 2023
Maintainer

Relative improvement is quite interesting as it accounts for difficulty of the task prior to averaging. XLM-roberta-base could be a good candidate, but it would give some odd results for the smaller models as they would have a negative score. That might be acceptable.

saattrupdan Dec 16, 2023
Maintainer Author

Relative improvement assumes that the base model is equally good at all the languages though, so we'd have to be careful with the choice of model. But otherwise sounds like a good idea 🙂

KennethEnevoldsen Dec 16, 2023
Maintainer

Alternatively, an option is to select the best possible dataset and then report a mean (same as usual but removing the correspondence criteria). Just to outline what bias that might introduce:

Lets say a balanced model would gain the scores:

0.9 (Sw), 0.9 (No), 0.6 (Da)
Indicating that Danish was the hardest language (either through harder tasks, inconsistent tagging, or just a more difficult language for LLMs).

The we imagine 2 new models:

A better multilingual model (moderate improvement across all)
- 0.95 (Sw), 0.95 (No), 0.65 (Da)
A better danish model (only improve Danish)
- 0.90 (Sw), 0.90 (No), 0.75 (Da)

Both of these would lead to a similar increase in performance, which would lead developers (who wants to compete on the benchmark) to develop more on the "harder" language. This in turn causes and incentive to increase the difficulty of ones own tasks. Note: This changes nothing pr. language and just having "inconsistent" datasets (impossible to gain a better score on) would not lead to the same priority.

saattrupdan · 2023-12-14T12:09:42Z

saattrupdan
Dec 14, 2023
Maintainer Author

Regarding adding new leaderboards, I'm currently thinking of having the following:

Mainland Scandinavian NLU
Mainland Scandinavian NLG
Insular Scandinavian NLU
Insular Scandinavian NLG
German NLU
German NLG
Dutch NLU
Dutch NLG
English NLU
English NLG

The NLU benchmarks are like the current one, and the NLG benchmarks only contain benchmarks of generative models, and is done on all the NLU and NLG tasks, where the NLU tasks are framed as generative tasks. Thus, there would be a substantial overlap between the NLU and NLG benchmarks.

This is not set in stone at all; it's merely a suggestion. The idea is that since the analysis in the ScandEval paper showed that it doesn't make sense to compare Mainland Scandinavian performance with German/Dutch/English/Icelandic/Faroese, it would be misleading to include them in the same leaderboard. The same analysis showed that it makes sense to merge Icelandic and Faroese. And maybe it would also show that German and Dutch should be merged. Don't know about English.

I guess one benefit of having separate leaderboards in this way is that it gives a bit more freedom, as we're not supposed to compare across different leaderboards. So, for instance, if German had a special task not present in the other languages, they could include that in their benchmark.

2 replies

saattrupdan Dec 14, 2023
Maintainer Author

Alternatively, it could be just two leaderboards: an NLU benchmark and an NLG benchmark, but where there are tabs for the language groups, like in SEB. Notably, of course, there would be no "All" category.

KennethEnevoldsen Dec 14, 2023
Maintainer

language tabs gives a nice overview. Potentially it might be relevant to create an "overview" tab with meaningful aggregations:

ALL (germanic?), mainland S, insular S, North germanic, west germanic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the goal of this benchmark? #85

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

What is the goal of this benchmark? #85

saattrupdan Dec 14, 2023 Maintainer

Replies: 2 comments · 14 replies

KennethEnevoldsen Dec 14, 2023 Maintainer

KennethEnevoldsen Dec 14, 2023 Maintainer

peterbjorgensen Dec 15, 2023 Maintainer

KennethEnevoldsen Dec 16, 2023 Maintainer

saattrupdan Dec 16, 2023 Maintainer Author

KennethEnevoldsen Dec 16, 2023 Maintainer

saattrupdan Dec 14, 2023 Maintainer Author

saattrupdan Dec 14, 2023 Maintainer Author

KennethEnevoldsen Dec 14, 2023 Maintainer

saattrupdan
Dec 14, 2023
Maintainer

Replies: 2 comments 14 replies

KennethEnevoldsen
Dec 14, 2023
Maintainer

KennethEnevoldsen Dec 14, 2023
Maintainer

peterbjorgensen Dec 15, 2023
Maintainer

KennethEnevoldsen Dec 16, 2023
Maintainer

saattrupdan Dec 16, 2023
Maintainer Author

KennethEnevoldsen Dec 16, 2023
Maintainer

saattrupdan
Dec 14, 2023
Maintainer Author

saattrupdan Dec 14, 2023
Maintainer Author

KennethEnevoldsen Dec 14, 2023
Maintainer