Properly Handle a case when Text size is more than model handle #164

lalitpagaria · 2021-07-09T09:18:43Z

When this project was started initial intention was to handle only short text. But now we have added Google News and Crawlers, hence there is need to handle longer text as well.
As we know that most of BERT based model support 512 max tokens (with few exceptions like BigBird). Currently Analyzer ignore (#113) excessive text.

@akar5h Suggested nice ways to split based on tokenizer output instead of char size.

So would it not be more optimal that instead of splitting texts, we split the tokenizer output ( tokenizer.batch_encode_plus output) like input_ids , attention mask into sequences of length 512 and feed these splits to the model ?

lalitpagaria · 2021-07-26T13:26:17Z

We can see if this can be incorporate here as suggested by @akar5h -

https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=n9qywopnIrJH&uniqifier=1
Scroll down to Preprocessing the training data , . Here we can see how huggingface tokenizer provides a way to split longer length documents as an inbuilt operation .

lalitpagaria added the enhancement New feature or request label Jul 9, 2021

lalitpagaria mentioned this issue Jul 9, 2021

Add InferenceAggregator node #163

Closed

akar5h added a commit to akar5h/obsei that referenced this issue Jul 10, 2021

obsei#164 , obsei#163 Analyzer integrating with TextSplitter

fbe6759

lalitpagaria pushed a commit to akar5h/obsei that referenced this issue Jul 26, 2021

obsei#164 , obsei#163 Analyzer integrating with TextSplitter

628f601

lalitpagaria mentioned this issue Jul 27, 2021

TransformersSummarizer crashes if given long input deepset-ai/haystack#1296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly Handle a case when Text size is more than model handle #164

Properly Handle a case when Text size is more than model handle #164

lalitpagaria commented Jul 9, 2021

lalitpagaria commented Jul 26, 2021

Properly Handle a case when Text size is more than model handle #164

Properly Handle a case when Text size is more than model handle #164

Comments

lalitpagaria commented Jul 9, 2021

lalitpagaria commented Jul 26, 2021