Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER - “input sequence after bert tokenization shouldn’t exceed 512 tokens” (ner_bert_base) #1686

Open
ghnp5 opened this issue Apr 21, 2024 · 1 comment
Labels

Comments

@ghnp5
Copy link

ghnp5 commented Apr 21, 2024

DeepPavlov version:
The latest docker container deeppavlov/deeppavlov, published last month

Python version:
3.10

Operating system:
The latest docker container deeppavlov/deeppavlov, published last month.
Docker is running on CentOS/AlmaLinux.

Issue:

I’m looking to understand how to prevent this crash from happening.

input sequence after bert tokenization shouldn’t exceed 512 tokens.

I’m using the REST API, so I’m calling ner_bert_base like this:

{
  "x": [
    "A huge text. Blah blah blah... No line breaks. I'm a 28 year-old person called John Smith, etc..."
  ]
}

While researching about this error, I found: #839 (comment)

which says:

Sorry, but the BERT model has positional embeddings only for first 512 subtokens. So, the model can’t work with longer sequences. It is a deliberate architecture restriction. Subtokens are produced by WordPiece tokenizer (BPE). 512 subtokens correspond approximately to 300-350 regular tokens for multilingual model. Make sure that you performed sentence tokenization before dumping the data. Every sentence in the dumped data should be separated by an empty line.

But I don’t fully understand what I need to do to resolve the problem.

What does “Make sure that you performed sentence tokenization before dumping the data” mean? Is it some function I need to call first, that returns the list of tokens? Is it something that I can call with the REST API from my application/code?

I was also looking to see if I could have my application (the caller) to somehow tokenize the words and punctuation, and then only send the first 512, but the thing is that it’s hard to preserve the spacing, and even if I send 512, it somehow still passes that limit in the model, crashing anyway.
I feel like I’m trying to reinvent the wheel.

Can’t we have the API and/or the model just silently (or by setting a flag/parameter in the input) truncate the text input past 512 tokens?

(Note that my application is not made in Python)

Thank you very much!

@ghnp5 ghnp5 added the bug label Apr 21, 2024
@ghnp5 ghnp5 changed the title NER - “input sequence after bert tokenization shouldn’t exceed 512 tokens” (ner_conll2003_bert) NER - “input sequence after bert tokenization shouldn’t exceed 512 tokens” (ner_bert_base) Apr 23, 2024
@ghnp5
Copy link
Author

ghnp5 commented Apr 24, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant