[BUG] Impact of structured generation on scores #415

BramVanroy · 2024-04-20T10:54:45Z

BramVanroy
Apr 20, 2024

🐛 Describe the bug

Not really a bug but more of a "worry".

I noticed that CoNLL scores for Dutch were heavily in favor of pretrained models for English, but that models that were later finetuned for Dutch scored much worse (which is unexpected for a Dutch benchmark). Edwin Rijgersberg went through the trouble to have a look at what's actually happening, and found that evaluation for CoNLL is done based on (an attempt by the model at) structured generation:

Zin: In haar lange leven had ze veel gereisd , maar moe-geavonturierd trok ze zich ten slotte terug in een cottage op Winchelsea , die ze van een tante had geërfd .
Genoemde entiteiten: {"persoon": [], "locatie": ["Winchelsea"], "organisatie": [], "diversen": []}

The model is expected to output the JSON structure. By default, models are notoriously bad at this type of thing. It is no surprise that pretrained models are "better at this", when the original training data likely contained some type of code, JSON, structured data in the mix. When we then fine-tune that model purely on text, the model may "forget" about those learnt structures. So it is not TOO surprising that Dutch models are worse at this task - but in reality they may be worse at generating structured data rather than worse at NER.

tl;dr The CoNLL task has a significant impact on the total score and rank of models. I fear that it currently does not correctly measure what we wish to measure (structured generation instead of NER tagging). Could we integrate with outlines or similar to avoid JSON issues?

Operating System

Linux

Device

CUDA GPU

Python version

3.10.x

ScandEval version

main

Rijgersberg · 2024-04-20T11:43:08Z

Rijgersberg
Apr 20, 2024

Looking a bit further into it, it seems that unparseable json output does not occur as frequently as I expected. Still, I think Bram's issue remains that this benchmark is really testing two properties at the same time, of which only one is related to the language.

Now I appreciate the trouble of evaluating generative models on token-based tasks, but maybe a transposed version of the IOB format makes sense?

Alex I-PER
is O
going O
to O
Los I-LOC
Angeles I-LOC
in O
California I-LOC

-->

Sentence: Alex is going to Los Angeles in California
Entities: person o o o location location o location

Logging with unparseable output:

scandeval --language nl --model Rijgersberg/GEITje-7B --task named-entity-recognition --batch-size 1 --verbose
2024-04-20 11:05:59 ⋅ Benchmarking Rijgersberg/GEITje-7B on the Dutch part of the truncated version of the named entity recognition dataset CoNLL 2002
2024-04-20 11:06:04 ⋅ Loading model and tokenizer...
2024-04-20 11:06:25 ⋅ The model has 7,241,732,096 parameters, a vocabulary size of 32,000, and a maximum sequence length of 32,768.
Preprocessing data splits: 100%|█████████████████████████████████████████| 12/12 [01:05<00:00,  5.46s/it]
Benchmarking:   0%|                                                               | 0/10 [00:00<?, ?it/s]2024-04-20 11:07:39 ⋅ Test scores for iteration 0: {'micro_f1_no_misc': 0.2902155887230514, 'micro_f1': 0.24113166485310122}
Benchmarking:  10%|█████▌                                                 | 1/10 [00:02<00:25,  2.81s/it]2024-04-20 11:07:42 ⋅ Test scores for iteration 1: {'micro_f1_no_misc': 0.42381348875936714, 'micro_f1': 0.32841747022121387}
Benchmarking:  20%|███████████                                            | 2/10 [00:05<00:22,  2.79s/it]2024-04-20 11:07:43 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["jullie", "jullie", "jullie", "jullie", "jullie"], "locatie": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"], "organisatie": ["jullie", "jullie", "jullie", "jullie", "jullie"], "diversen": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"]'
2024-04-20 11:07:43 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["jullie", "jullie", "jullie", "jullie", "jullie"], "locatie": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"], "organisatie": ["jullie", "jullie", "jullie", "jullie", "jullie"], "diversen": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"]'
2024-04-20 11:07:43 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["jullie", "jullie", "jullie", "jullie", "jullie"], "locatie": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"], "organisatie": ["jullie", "jullie", "jullie", "jullie", "jullie"], "diversen": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"]'
2024-04-20 11:07:43 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["jullie", "jullie", "jullie", "jullie", "jullie"], "locatie": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"], "organisatie": ["jullie", "jullie", "jullie", "jullie", "jullie"], "diversen": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"]'
2024-04-20 11:07:43 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["jullie", "jullie", "jullie", "jullie", "jullie"], "locatie": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"], "organisatie": ["jullie", "jullie", "jullie", "jullie", "jullie"], "diversen": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"]'
2024-04-20 11:07:44 ⋅ Test scores for iteration 2: {'micro_f1_no_misc': 0.4287234042553192, 'micro_f1': 0.28315317690482467}
Benchmarking:  30%|████████████████▌                                      | 3/10 [00:08<00:19,  2.79s/it]2024-04-20 11:07:47 ⋅ Test scores for iteration 3: {'micro_f1_no_misc': 0.31216526396327465, 'micro_f1': 0.30631136044880786}
Benchmarking:  40%|██████████████████████                                 | 4/10 [00:10<00:16,  2.69s/it]2024-04-20 11:07:49 ⋅ Test scores for iteration 4: {'micro_f1_no_misc': 0.48640000000000005, 'micro_f1': 0.3169728544008774}
Benchmarking:  50%|███████████████████████████▌                           | 5/10 [00:13<00:13,  2.65s/it2024-04-20 11:10:37 ⋅ Test scores for iteration 5: {'micro_f1_no_misc': 0.4564537740062528, 'micro_f1': 0.35082458770614694}
Benchmarking:  60%|█████████████████████████████████                      | 6/10 [03:01<03:55, 58.80s/it2024-04-20 11:13:56 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["leerjaren", "leerjaar", "leerlingen", "leerling", "leerkracht"], "locatie": ["bos", "zee", "zeeklassen", "sneeuwklas", "sneeuwklassen"], "organisatie": ["klassen", "klas", "leerlingen", "leerlingenklas", "leerlingenklassen"], "diversen": ["geïntegreerde", "geïntegreerde werkperiode", "ge'
                                                                                                        2024-04-20 11:13:59 ⋅ Test scores for iteration 6: {'micro_f1_no_misc': 0.35891472868217045, 'micro_f1': 0.28664822730701534}
Benchmarking:  70%|█████████████████████████████████████▊                | 7/10 [06:23<05:17, 105.69s/it2024-04-20 11:16:55 ⋅ Test scores for iteration 7: {'micro_f1_no_misc': 0.34513634513634517, 'micro_f1': 0.31033557046979865}
Benchmarking:  80%|███████████████████████████████████████████▏          | 8/10 [09:19<04:15, 127.94s/it2024-04-20 11:19:55 ⋅ Test scores for iteration 8: {'micro_f1_no_misc': 0.4107505070993914, 'micro_f1': 0.2976113621691414}
Benchmarking:  90%|████████████████████████████████████████████████▌     | 9/10 [12:19<02:24, 144.28s/it2024-04-20 11:22:49 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "locatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "organisatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emball'
                                                                                                        2024-04-20 11:22:51 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "locatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "organisatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emball'
2024-04-20 11:22:51 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "locatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "organisatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emball'
2024-04-20 11:22:52 ⋅ Test scores for iteration 9: {'micro_f1_no_misc': 0.349360388178209, 'micro_f1': 0.2681858019281332}
Benchmarking: 100%|██████████████████████████████████████████████████████| 10/10 [15:15<00:00, 91.58s/it]
2024-04-20 11:22:52 ⋅ Finished evaluation of Rijgersberg/GEITje-7B on the Dutch part of the truncated version of the named entity recognition dataset CoNLL 2002.
2024-04-20 11:22:52 ⋅ Micro-average F1-score without MISC tags: 38.62% ± 3.99%
2024-04-20 11:22:52 ⋅ Micro-average F1-score with MISC tags: 29.90% ± 1.93%
2024-04-20 11:22:52 ⋅ Results:
{'raw': defaultdict(<class 'list'>, {'test': [{'micro_f1_no_misc': 0.2902155887230514, 'micro_f1': 0.24113166485310122}, {'micro_f1_no_misc': 0.42381348875936714, 'micro_f1': 0.32841747022121387}, {'micro_f1_no_misc': 0.4287234042553192, 'micro_f1': 0.28315317690482467}, {'micro_f1_no_misc': 0.31216526396327465, 'micro_f1': 0.30631136044880786}, {'micro_f1_no_misc': 0.48640000000000005, 'micro_f1': 0.3169728544008774}, {'micro_f1_no_misc': 0.4564537740062528, 'micro_f1': 0.35082458770614694}, {'micro_f1_no_misc': 0.35891472868217045, 'micro_f1': 0.28664822730701534}, {'micro_f1_no_misc': 0.34513634513634517, 'micro_f1': 0.31033557046979865}, {'micro_f1_no_misc': 0.4107505070993914, 'micro_f1': 0.2976113621691414}, {'micro_f1_no_misc': 0.349360388178209, 'micro_f1': 0.2681858019281332}]}), 'total': {'test_micro_f1_no_misc': 38.61933488803381, 'test_micro_f1_no_misc_se': 3.9892615879301965, 'test_micro_f1': 29.89592076409061, 'test_micro_f1_se': 1.933691061140005}}

0 replies

saattrupdan · 2024-04-20T12:58:49Z

saattrupdan
Apr 20, 2024
Maintainer

Thanks for your comments, both of you! We actually already use outlines in the generation for the NER task, exactly for the reasons you mention.

The only way it's possible for the models to output invalid json is when they're running out of tokens, which is what happened in the examples you posted.

0 replies

saattrupdan · 2024-04-20T13:00:43Z

saattrupdan
Apr 20, 2024
Maintainer

I did consider other formats as well, but the main downside of many of them is that they have to output the entire input document, which is not very feasible (and definitely not how they'd be used in practice)

0 replies

BramVanroy · 2024-04-20T13:50:38Z

BramVanroy
Apr 20, 2024
Author

In that case I fear there is little to do, rather than emphasizing ourselves that benchmarks for structured generation have their own issues and that models who are not too familiar with generating code fragments/structured data may score more poorly, which may not be indicative of the contents of the task but only for the representation of the task.

Thanks for brainstorming though!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Impact of structured generation on scores #415

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[BUG] Impact of structured generation on scores #415

BramVanroy Apr 20, 2024

🐛 Describe the bug

Operating System

Device

Python version

ScandEval version

Replies: 4 comments

Rijgersberg Apr 20, 2024

saattrupdan Apr 20, 2024 Maintainer

saattrupdan Apr 20, 2024 Maintainer

BramVanroy Apr 20, 2024 Author

BramVanroy
Apr 20, 2024

Rijgersberg
Apr 20, 2024

saattrupdan
Apr 20, 2024
Maintainer

saattrupdan
Apr 20, 2024
Maintainer

BramVanroy
Apr 20, 2024
Author