Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word_language_model/data.py - remove '<eos>' #1228

Open
drtonyr opened this issue Feb 5, 2024 · 0 comments
Open

word_language_model/data.py - remove '<eos>' #1228

drtonyr opened this issue Feb 5, 2024 · 0 comments

Comments

@drtonyr
Copy link

drtonyr commented Feb 5, 2024

A long long time ago, back in the days when n-gram modelling ruled, our LMs didn't carry context over from one sentence to another. In the current era of LLMs we use many past sentences as context.

The supplied data in data/wikitext-2 acknowledges this change and uses '.' as the token to represent the full stop at the end of words. However, the code in data.py is still n-gram style and explicitly enforces the token <eos> at the end of every line.

This is (extremely valuable!) example code, it should be as clean and general as possible. The use of <eos> adds nothing and should be removed. Indeed, it can be seen as a bug as it's adding an artificial token when it's easily predictable so it will artificially reduce perplexity. Example code should be general and not have hard wired variables in it from a legacy implementation. If people want to have a <eos> token then it should be in the data like wikitext-2, if they don't want to use it then it shouldn't be enforced.

It would be very easy to remove the cases of + ['<eos>'] in data.py and the resulting example code would be more general and more scientific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant