Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any suggestions to handle longer text? #46

Open
cyriltw opened this issue Mar 23, 2022 · 5 comments
Open

Any suggestions to handle longer text? #46

cyriltw opened this issue Mar 23, 2022 · 5 comments

Comments

@cyriltw
Copy link

cyriltw commented Mar 23, 2022

I'm trying to do predictions with the pre-trained model and I keep running into the issue of;

Token indices sequence length is longer than the specified maximum sequence length for this model (1142 > 512). Running this sequence through the model will result in indexing errors
*** RuntimeError: The size of tensor a (1142) must match the size of tensor b (512) at non-singleton dimension 1

The issue is when I try to predict a text that is longer than 512, this happens. I understand this is because the string is long, other than chopping off the string. Is there any suggestions on how to deal with this problem with the package?

Thank you

@laurahanu
Copy link
Collaborator

Hello!
This package is not really designed for long form text and the transformer models used (e.g. BER, RoBERTa) have a max sequence length of 512. To get around this, one option would be to split your text into chunks, feed those to the model and then average the results, would that work for your case?

@cyriltw
Copy link
Author

cyriltw commented Mar 23, 2022

Hello!
This package is not really designed for long form text and the transformer models used (e.g. BER, RoBERTa) have a max sequence length of 512. To get around this, one option would be to split your text into chunks, feed those to the model and then average the results, would that work for your case?

Hi Laura, I see. Thanks for the insights. I was actually thinking about splitting the post and then avergaing the results. Just wanted to check if there is inbuilt way to handle it.

@sorensenjs
Copy link

Just a suggestion: taking the max over the splits, perhaps breaking at sentences would likely be better than averaging. The model tends to work as a detector, so finding any objectionable content in any part should disqualify the whole document.

@Jessareid
Copy link

Another way to increase the limit a little bit would be to implement stopword removal before it becomes a sequence?

@cyriltw
Copy link
Author

cyriltw commented Aug 15, 2022

Hi, I'd like to open this up again as the initial fixes kind of failed again when the text passed is in not English. The normal length detection in python would identify as less than 512, but the parser would detect it being more than it. This is the case for text which are non English like Mandarin. Do you have any idea what type of unicode conversion is happening/how to remove such cases before it goes to prediction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants