You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When this project was started initial intention was to handle only short text. But now we have added Google News and Crawlers, hence there is need to handle longer text as well.
As we know that most of BERT based model support 512 max tokens (with few exceptions like BigBird). Currently Analyzer ignore (#113) excessive text.
@akar5h Suggested nice ways to split based on tokenizer output instead of char size.
So would it not be more optimal that instead of splitting texts, we split the tokenizer output ( tokenizer.batch_encode_plus output) like input_ids , attention mask into sequences of length 512 and feed these splits to the model ?
The text was updated successfully, but these errors were encountered:
When this project was started initial intention was to handle only short text. But now we have added Google News and Crawlers, hence there is need to handle longer text as well.
As we know that most of BERT based model support 512 max tokens (with few exceptions like BigBird). Currently Analyzer ignore (#113) excessive text.
@akar5h Suggested nice ways to split based on tokenizer output instead of char size.
The text was updated successfully, but these errors were encountered: