[BUG] Generation does not terminate on single newline #432

iPieter · 2024-05-06T19:46:58Z

🐛 Describe the bug

As the title says, the generation in various benchmarks does not stop when the model returns a single newline (\n).

The model I am currently evaluating consistently outputs only a single newline, but the answers all seem to be correct. This is an issue for at least two benchmarks I looked at: SQUAD-NL and dutch social.

Output for SQUAD-NL

{"completion": " 50%.\nTekst: De eerste grote studie naar de effecten van de Zwarte Dood op de Europese economie werd in de jaren 1960-1970 uitgevoerd door", "top_score_indices": null, "top_score_values": null, "vocab_size": null},

Output for Dutch social

{"completion": " positief\nT", "top_score_indices": [[11169, 15752, 14481, 20764, 11000, 12240, 28630, 2935, 32897, 3361], [203, 2, 202, 18, 1129, 458, 19, 18935, 225, 16], [56, 55, 458, 54, 25630, 42, 44, 76, 3894, 88]]

In both cases the model, which is just a base model and not instruction-tuned, just keeps rambling on. This is a disadvantage of those type of models, but it would be nice to configure the benchmark suite to accept a single newline. This way I can at least get a sense of how good the model is.

Double newlines are supported, as well as two single newlines. An option to make this configurable would be really nice.

Operating System

Linux

Device

CUDA GPU

Python version

3.11.x

ScandEval version

ScandEval==12.9.1

The text was updated successfully, but these errors were encountered:

iPieter · 2024-05-10T16:04:05Z

Update: I changed the code to remove any tokens after newlines and all other stopwords. This affects the performance of my model significantly. For instance SQUAD-NL goes from 0.00 / 14.18 to 46.58/62.01. This goes from frankly quite bad to a pretty decent model.

There is also a second issue, namely that the generation does not stop on stopwords unless all sequences in the batch have reached that stop word. I don't think that this is intended, as for chat-models the generation gets post-processed to remove tokens afterwards as well. I still have to investigate that a bit, but I will open an issue as well.

saattrupdan · 2024-05-12T09:47:45Z

Hi @iPieter, and thanks for your bug report!

You're right that we currently use double newlines as the stopping tokens, as the prompts always have the following structure:

[short intro starting with "The following are"]

Text: [sample text]
Label: [label/answer]

Text: [sample text]
Label: [label/answer]

[... more few-shot examples ...]

Text: [sample text]
Label: [label/answer]

Text: [sample text]
Label:

A double newline is thus a natural stopping point, also for base models, so it sounds a bit curious that your model doesn't output these.

That being said though, we could assign single newlines as stopping tokens as well, as I can't (off the top of my head) see anything that would break with that addition - I'd have to check though.

We can't make it configurable, however, as that would mean multiple different evaluation scores for each model, which goes against the spirit of the benchmark.

iPieter added the bug Something isn't working label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Generation does not terminate on single newline #432

[BUG] Generation does not terminate on single newline #432

iPieter commented May 6, 2024

iPieter commented May 10, 2024

saattrupdan commented May 12, 2024

[BUG] Generation does not terminate on single newline #432

[BUG] Generation does not terminate on single newline #432

Comments

iPieter commented May 6, 2024

🐛 Describe the bug

Operating System

Device

Python version

ScandEval version

iPieter commented May 10, 2024

saattrupdan commented May 12, 2024