[BUG] Impact of structured generation on scores #415
Replies: 4 comments
-
Looking a bit further into it, it seems that unparseable json output does not occur as frequently as I expected. Still, I think Bram's issue remains that this benchmark is really testing two properties at the same time, of which only one is related to the language. Now I appreciate the trouble of evaluating generative models on token-based tasks, but maybe a transposed version of the IOB format makes sense?
-->
Logging with unparseable output: scandeval --language nl --model Rijgersberg/GEITje-7B --task named-entity-recognition --batch-size 1 --verbose
2024-04-20 11:05:59 ⋅ Benchmarking Rijgersberg/GEITje-7B on the Dutch part of the truncated version of the named entity recognition dataset CoNLL 2002
2024-04-20 11:06:04 ⋅ Loading model and tokenizer...
2024-04-20 11:06:25 ⋅ The model has 7,241,732,096 parameters, a vocabulary size of 32,000, and a maximum sequence length of 32,768.
Preprocessing data splits: 100%|█████████████████████████████████████████| 12/12 [01:05<00:00, 5.46s/it]
Benchmarking: 0%| | 0/10 [00:00<?, ?it/s]2024-04-20 11:07:39 ⋅ Test scores for iteration 0: {'micro_f1_no_misc': 0.2902155887230514, 'micro_f1': 0.24113166485310122}
Benchmarking: 10%|█████▌ | 1/10 [00:02<00:25, 2.81s/it]2024-04-20 11:07:42 ⋅ Test scores for iteration 1: {'micro_f1_no_misc': 0.42381348875936714, 'micro_f1': 0.32841747022121387}
Benchmarking: 20%|███████████ | 2/10 [00:05<00:22, 2.79s/it]2024-04-20 11:07:43 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["jullie", "jullie", "jullie", "jullie", "jullie"], "locatie": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"], "organisatie": ["jullie", "jullie", "jullie", "jullie", "jullie"], "diversen": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"]'
2024-04-20 11:07:43 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["jullie", "jullie", "jullie", "jullie", "jullie"], "locatie": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"], "organisatie": ["jullie", "jullie", "jullie", "jullie", "jullie"], "diversen": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"]'
2024-04-20 11:07:43 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["jullie", "jullie", "jullie", "jullie", "jullie"], "locatie": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"], "organisatie": ["jullie", "jullie", "jullie", "jullie", "jullie"], "diversen": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"]'
2024-04-20 11:07:43 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["jullie", "jullie", "jullie", "jullie", "jullie"], "locatie": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"], "organisatie": ["jullie", "jullie", "jullie", "jullie", "jullie"], "diversen": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"]'
2024-04-20 11:07:43 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["jullie", "jullie", "jullie", "jullie", "jullie"], "locatie": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"], "organisatie": ["jullie", "jullie", "jullie", "jullie", "jullie"], "diversen": ["nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start", "nieuwe start"]'
2024-04-20 11:07:44 ⋅ Test scores for iteration 2: {'micro_f1_no_misc': 0.4287234042553192, 'micro_f1': 0.28315317690482467}
Benchmarking: 30%|████████████████▌ | 3/10 [00:08<00:19, 2.79s/it]2024-04-20 11:07:47 ⋅ Test scores for iteration 3: {'micro_f1_no_misc': 0.31216526396327465, 'micro_f1': 0.30631136044880786}
Benchmarking: 40%|██████████████████████ | 4/10 [00:10<00:16, 2.69s/it]2024-04-20 11:07:49 ⋅ Test scores for iteration 4: {'micro_f1_no_misc': 0.48640000000000005, 'micro_f1': 0.3169728544008774}
Benchmarking: 50%|███████████████████████████▌ | 5/10 [00:13<00:13, 2.65s/it2024-04-20 11:10:37 ⋅ Test scores for iteration 5: {'micro_f1_no_misc': 0.4564537740062528, 'micro_f1': 0.35082458770614694}
Benchmarking: 60%|█████████████████████████████████ | 6/10 [03:01<03:55, 58.80s/it2024-04-20 11:13:56 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["leerjaren", "leerjaar", "leerlingen", "leerling", "leerkracht"], "locatie": ["bos", "zee", "zeeklassen", "sneeuwklas", "sneeuwklassen"], "organisatie": ["klassen", "klas", "leerlingen", "leerlingenklas", "leerlingenklassen"], "diversen": ["geïntegreerde", "geïntegreerde werkperiode", "ge'
2024-04-20 11:13:59 ⋅ Test scores for iteration 6: {'micro_f1_no_misc': 0.35891472868217045, 'micro_f1': 0.28664822730701534}
Benchmarking: 70%|█████████████████████████████████████▊ | 7/10 [06:23<05:17, 105.69s/it2024-04-20 11:16:55 ⋅ Test scores for iteration 7: {'micro_f1_no_misc': 0.34513634513634517, 'micro_f1': 0.31033557046979865}
Benchmarking: 80%|███████████████████████████████████████████▏ | 8/10 [09:19<04:15, 127.94s/it2024-04-20 11:19:55 ⋅ Test scores for iteration 8: {'micro_f1_no_misc': 0.4107505070993914, 'micro_f1': 0.2976113621691414}
Benchmarking: 90%|████████████████████████████████████████████████▌ | 9/10 [12:19<02:24, 144.28s/it2024-04-20 11:22:49 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "locatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "organisatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emball'
2024-04-20 11:22:51 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "locatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "organisatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emball'
2024-04-20 11:22:51 ⋅ The model output is not valid JSON, so cannot parse it. Skipping. Here is the output: '{"persoon": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "locatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emballages Ensemble , Orishas , Manu Dibango en Dirty Beatniks"], "organisatie": ["Humb", "Starflam", "Mad Professor", "Think of One", "Marrakech Emball'
2024-04-20 11:22:52 ⋅ Test scores for iteration 9: {'micro_f1_no_misc': 0.349360388178209, 'micro_f1': 0.2681858019281332}
Benchmarking: 100%|██████████████████████████████████████████████████████| 10/10 [15:15<00:00, 91.58s/it]
2024-04-20 11:22:52 ⋅ Finished evaluation of Rijgersberg/GEITje-7B on the Dutch part of the truncated version of the named entity recognition dataset CoNLL 2002.
2024-04-20 11:22:52 ⋅ Micro-average F1-score without MISC tags: 38.62% ± 3.99%
2024-04-20 11:22:52 ⋅ Micro-average F1-score with MISC tags: 29.90% ± 1.93%
2024-04-20 11:22:52 ⋅ Results:
{'raw': defaultdict(<class 'list'>, {'test': [{'micro_f1_no_misc': 0.2902155887230514, 'micro_f1': 0.24113166485310122}, {'micro_f1_no_misc': 0.42381348875936714, 'micro_f1': 0.32841747022121387}, {'micro_f1_no_misc': 0.4287234042553192, 'micro_f1': 0.28315317690482467}, {'micro_f1_no_misc': 0.31216526396327465, 'micro_f1': 0.30631136044880786}, {'micro_f1_no_misc': 0.48640000000000005, 'micro_f1': 0.3169728544008774}, {'micro_f1_no_misc': 0.4564537740062528, 'micro_f1': 0.35082458770614694}, {'micro_f1_no_misc': 0.35891472868217045, 'micro_f1': 0.28664822730701534}, {'micro_f1_no_misc': 0.34513634513634517, 'micro_f1': 0.31033557046979865}, {'micro_f1_no_misc': 0.4107505070993914, 'micro_f1': 0.2976113621691414}, {'micro_f1_no_misc': 0.349360388178209, 'micro_f1': 0.2681858019281332}]}), 'total': {'test_micro_f1_no_misc': 38.61933488803381, 'test_micro_f1_no_misc_se': 3.9892615879301965, 'test_micro_f1': 29.89592076409061, 'test_micro_f1_se': 1.933691061140005}} |
Beta Was this translation helpful? Give feedback.
-
Thanks for your comments, both of you! We actually already use outlines in the generation for the NER task, exactly for the reasons you mention. The only way it's possible for the models to output invalid json is when they're running out of tokens, which is what happened in the examples you posted. |
Beta Was this translation helpful? Give feedback.
-
I did consider other formats as well, but the main downside of many of them is that they have to output the entire input document, which is not very feasible (and definitely not how they'd be used in practice) |
Beta Was this translation helpful? Give feedback.
-
In that case I fear there is little to do, rather than emphasizing ourselves that benchmarks for structured generation have their own issues and that models who are not too familiar with generating code fragments/structured data may score more poorly, which may not be indicative of the contents of the task but only for the representation of the task. Thanks for brainstorming though! |
Beta Was this translation helpful? Give feedback.
-
🐛 Describe the bug
Not really a bug but more of a "worry".
I noticed that CoNLL scores for Dutch were heavily in favor of pretrained models for English, but that models that were later finetuned for Dutch scored much worse (which is unexpected for a Dutch benchmark). Edwin Rijgersberg went through the trouble to have a look at what's actually happening, and found that evaluation for CoNLL is done based on (an attempt by the model at) structured generation:
The model is expected to output the JSON structure. By default, models are notoriously bad at this type of thing. It is no surprise that pretrained models are "better at this", when the original training data likely contained some type of code, JSON, structured data in the mix. When we then fine-tune that model purely on text, the model may "forget" about those learnt structures. So it is not TOO surprising that Dutch models are worse at this task - but in reality they may be worse at generating structured data rather than worse at NER.
tl;dr The CoNLL task has a significant impact on the total score and rank of models. I fear that it currently does not correctly measure what we wish to measure (structured generation instead of NER tagging). Could we integrate with
outlines
or similar to avoid JSON issues?Operating System
Linux
Device
CUDA GPU
Python version
3.10.x
ScandEval version
main
Beta Was this translation helpful? Give feedback.
All reactions