Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly Handle a case when Text size is more than model handle #164

Open
lalitpagaria opened this issue Jul 9, 2021 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@lalitpagaria
Copy link
Collaborator

When this project was started initial intention was to handle only short text. But now we have added Google News and Crawlers, hence there is need to handle longer text as well.
As we know that most of BERT based model support 512 max tokens (with few exceptions like BigBird). Currently Analyzer ignore (#113) excessive text.

@akar5h Suggested nice ways to split based on tokenizer output instead of char size.

So would it not be more optimal that instead of splitting texts, we split the tokenizer output ( tokenizer.batch_encode_plus output) like input_ids , attention mask into sequences of length 512 and feed these splits to the model ?

@lalitpagaria lalitpagaria added the enhancement New feature or request label Jul 9, 2021
akar5h added a commit to akar5h/obsei that referenced this issue Jul 10, 2021
lalitpagaria pushed a commit to akar5h/obsei that referenced this issue Jul 26, 2021
@lalitpagaria
Copy link
Collaborator Author

We can see if this can be incorporate here as suggested by @akar5h -

https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=n9qywopnIrJH&uniqifier=1
Scroll down to Preprocessing the training data , . Here we can see how huggingface tokenizer provides a way to split longer length documents as an inbuilt operation .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant