You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As the title says, the generation in various benchmarks does not stop when the model returns a single newline (\n).
The model I am currently evaluating consistently outputs only a single newline, but the answers all seem to be correct. This is an issue for at least two benchmarks I looked at: SQUAD-NL and dutch social.
Output for SQUAD-NL
{"completion": " 50%.\nTekst: De eerste grote studie naar de effecten van de Zwarte Dood op de Europese economie werd in de jaren 1960-1970 uitgevoerd door", "top_score_indices": null, "top_score_values": null, "vocab_size": null},
In both cases the model, which is just a base model and not instruction-tuned, just keeps rambling on. This is a disadvantage of those type of models, but it would be nice to configure the benchmark suite to accept a single newline. This way I can at least get a sense of how good the model is.
Update: I changed the code to remove any tokens after newlines and all other stopwords. This affects the performance of my model significantly. For instance SQUAD-NL goes from 0.00 / 14.18 to 46.58/62.01. This goes from frankly quite bad to a pretty decent model.
There is also a second issue, namely that the generation does not stop on stopwords unless all sequences in the batch have reached that stop word. I don't think that this is intended, as for chat-models the generation gets post-processed to remove tokens afterwards as well. I still have to investigate that a bit, but I will open an issue as well.
You're right that we currently use double newlines as the stopping tokens, as the prompts always have the following structure:
[short intro starting with "The following are"]
Text: [sample text]
Label: [label/answer]
Text: [sample text]
Label: [label/answer]
[... more few-shot examples ...]
Text: [sample text]
Label: [label/answer]
Text: [sample text]
Label:
A double newline is thus a natural stopping point, also for base models, so it sounds a bit curious that your model doesn't output these.
That being said though, we could assign single newlines as stopping tokens as well, as I can't (off the top of my head) see anything that would break with that addition - I'd have to check though.
We can't make it configurable, however, as that would mean multiple different evaluation scores for each model, which goes against the spirit of the benchmark.
馃悰 Describe the bug
As the title says, the generation in various benchmarks does not stop when the model returns a single newline (
\n
).The model I am currently evaluating consistently outputs only a single newline, but the answers all seem to be correct. This is an issue for at least two benchmarks I looked at: SQUAD-NL and dutch social.
Output for SQUAD-NL
Output for Dutch social
In both cases the model, which is just a base model and not instruction-tuned, just keeps rambling on. This is a disadvantage of those type of models, but it would be nice to configure the benchmark suite to accept a single newline. This way I can at least get a sense of how good the model is.
Double newlines are supported, as well as two single newlines. An option to make this configurable would be really nice.
Operating System
Linux
Device
CUDA GPU
Python version
3.11.x
ScandEval version
ScandEval==12.9.1
The text was updated successfully, but these errors were encountered: