Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No assert: Training does not start when using different tokenizer/ tokenized-data #148

Open
adriwitek opened this issue Dec 13, 2023 · 0 comments

Comments

@adriwitek
Copy link

Since there are many problems, specially in moltinode setting, I want to open this issue since I have spend a lot of time trying to debbug it.

If you use data tokenized with a tokenizer, let's call it X, and the input data you are passing to the model has been tokenized with a tokenizer Y, then the training will froze and never start!

Seems like an obvious thing, but no warning or error is displayed in the code and the execution will continue apparently running, despite the tokenizer X can have a different vocab size! An assert should be added since this can cause a lot of unused resources, and doing a pre-training it's not cheap at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant