Any suggestions to handle longer text? #46

cyriltw · 2022-03-23T16:26:44Z

I'm trying to do predictions with the pre-trained model and I keep running into the issue of;

Token indices sequence length is longer than the specified maximum sequence length for this model (1142 > 512). Running this sequence through the model will result in indexing errors
*** RuntimeError: The size of tensor a (1142) must match the size of tensor b (512) at non-singleton dimension 1

The issue is when I try to predict a text that is longer than 512, this happens. I understand this is because the string is long, other than chopping off the string. Is there any suggestions on how to deal with this problem with the package?

Thank you

The text was updated successfully, but these errors were encountered:

laurahanu · 2022-03-23T18:42:59Z

Hello!
This package is not really designed for long form text and the transformer models used (e.g. BER, RoBERTa) have a max sequence length of 512. To get around this, one option would be to split your text into chunks, feed those to the model and then average the results, would that work for your case?

cyriltw · 2022-03-23T18:55:01Z

Hello!
This package is not really designed for long form text and the transformer models used (e.g. BER, RoBERTa) have a max sequence length of 512. To get around this, one option would be to split your text into chunks, feed those to the model and then average the results, would that work for your case?

Hi Laura, I see. Thanks for the insights. I was actually thinking about splitting the post and then avergaing the results. Just wanted to check if there is inbuilt way to handle it.

sorensenjs · 2022-03-23T19:03:25Z

Just a suggestion: taking the max over the splits, perhaps breaking at sentences would likely be better than averaging. The model tends to work as a detector, so finding any objectionable content in any part should disqualify the whole document.

Jessareid · 2022-05-12T13:41:48Z

Another way to increase the limit a little bit would be to implement stopword removal before it becomes a sequence?

cyriltw · 2022-08-15T15:06:52Z

Hi, I'd like to open this up again as the initial fixes kind of failed again when the text passed is in not English. The normal length detection in python would identify as less than 512, but the parser would detect it being more than it. This is the case for text which are non English like Mandarin. Do you have any idea what type of unicode conversion is happening/how to remove such cases before it goes to prediction?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any suggestions to handle longer text? #46

Any suggestions to handle longer text? #46

cyriltw commented Mar 23, 2022 •

edited

laurahanu commented Mar 23, 2022

cyriltw commented Mar 23, 2022

sorensenjs commented Mar 23, 2022

Jessareid commented May 12, 2022

cyriltw commented Aug 15, 2022

Any suggestions to handle longer text? #46

Any suggestions to handle longer text? #46

Comments

cyriltw commented Mar 23, 2022 • edited

laurahanu commented Mar 23, 2022

cyriltw commented Mar 23, 2022

sorensenjs commented Mar 23, 2022

Jessareid commented May 12, 2022

cyriltw commented Aug 15, 2022

cyriltw commented Mar 23, 2022 •

edited