Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Generation does not terminate on single newline #432

Open
iPieter opened this issue May 6, 2024 · 2 comments
Open

[BUG] Generation does not terminate on single newline #432

iPieter opened this issue May 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@iPieter
Copy link

iPieter commented May 6, 2024

馃悰 Describe the bug

As the title says, the generation in various benchmarks does not stop when the model returns a single newline (\n).

The model I am currently evaluating consistently outputs only a single newline, but the answers all seem to be correct. This is an issue for at least two benchmarks I looked at: SQUAD-NL and dutch social.

Output for SQUAD-NL

{"completion": " 50%.\nTekst: De eerste grote studie naar de effecten van de Zwarte Dood op de Europese economie werd in de jaren 1960-1970 uitgevoerd door", "top_score_indices": null, "top_score_values": null, "vocab_size": null},

Output for Dutch social

{"completion": " positief\nT", "top_score_indices": [[11169, 15752, 14481, 20764, 11000, 12240, 28630, 2935, 32897, 3361], [203, 2, 202, 18, 1129, 458, 19, 18935, 225, 16], [56, 55, 458, 54, 25630, 42, 44, 76, 3894, 88]]

In both cases the model, which is just a base model and not instruction-tuned, just keeps rambling on. This is a disadvantage of those type of models, but it would be nice to configure the benchmark suite to accept a single newline. This way I can at least get a sense of how good the model is.

Double newlines are supported, as well as two single newlines. An option to make this configurable would be really nice.

Operating System

Linux

Device

CUDA GPU

Python version

3.11.x

ScandEval version

ScandEval==12.9.1

@iPieter iPieter added the bug Something isn't working label May 6, 2024
@iPieter
Copy link
Author

iPieter commented May 10, 2024

Update: I changed the code to remove any tokens after newlines and all other stopwords. This affects the performance of my model significantly. For instance SQUAD-NL goes from 0.00 / 14.18 to 46.58/62.01. This goes from frankly quite bad to a pretty decent model.

There is also a second issue, namely that the generation does not stop on stopwords unless all sequences in the batch have reached that stop word. I don't think that this is intended, as for chat-models the generation gets post-processed to remove tokens afterwards as well. I still have to investigate that a bit, but I will open an issue as well.

@saattrupdan
Copy link
Member

Hi @iPieter, and thanks for your bug report!

You're right that we currently use double newlines as the stopping tokens, as the prompts always have the following structure:

[short intro starting with "The following are"]

Text: [sample text]
Label: [label/answer]

Text: [sample text]
Label: [label/answer]

[... more few-shot examples ...]

Text: [sample text]
Label: [label/answer]

Text: [sample text]
Label: 

A double newline is thus a natural stopping point, also for base models, so it sounds a bit curious that your model doesn't output these.

That being said though, we could assign single newlines as stopping tokens as well, as I can't (off the top of my head) see anything that would break with that addition - I'd have to check though.

We can't make it configurable, however, as that would mean multiple different evaluation scores for each model, which goes against the spirit of the benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants