NER - “input sequence after bert tokenization shouldn’t exceed 512 tokens” (ner_bert_base) #1686

ghnp5 · 2024-04-21T22:35:55Z

DeepPavlov version:
The latest docker container deeppavlov/deeppavlov, published last month

Python version:
3.10

Operating system:
The latest docker container deeppavlov/deeppavlov, published last month.
Docker is running on CentOS/AlmaLinux.

Issue:

I’m looking to understand how to prevent this crash from happening.

input sequence after bert tokenization shouldn’t exceed 512 tokens.

I’m using the REST API, so I’m calling ner_bert_base like this:

{
  "x": [
    "A huge text. Blah blah blah... No line breaks. I'm a 28 year-old person called John Smith, etc..."
  ]
}

While researching about this error, I found: #839 (comment)

which says:

Sorry, but the BERT model has positional embeddings only for first 512 subtokens. So, the model can’t work with longer sequences. It is a deliberate architecture restriction. Subtokens are produced by WordPiece tokenizer (BPE). 512 subtokens correspond approximately to 300-350 regular tokens for multilingual model. Make sure that you performed sentence tokenization before dumping the data. Every sentence in the dumped data should be separated by an empty line.

But I don’t fully understand what I need to do to resolve the problem.

What does “Make sure that you performed sentence tokenization before dumping the data” mean? Is it some function I need to call first, that returns the list of tokens? Is it something that I can call with the REST API from my application/code?

I was also looking to see if I could have my application (the caller) to somehow tokenize the words and punctuation, and then only send the first 512, but the thing is that it’s hard to preserve the spacing, and even if I send 512, it somehow still passes that limit in the model, crashing anyway.
I feel like I’m trying to reinvent the wheel.

Can’t we have the API and/or the model just silently (or by setting a flag/parameter in the input) truncate the text input past 512 tokens?

(Note that my application is not made in Python)

Thank you very much!

The text was updated successfully, but these errors were encountered:

ghnp5 · 2024-04-24T01:16:57Z

This actually resolves my issues: https://github.com/deeppavlov/DeepPavlov/pull/1657/files#diff-c2feefe4ebd288d44761cad4fbe6c29d43da997a00c597f6281e89ceca3a57d2

ghnp5 added the bug label Apr 21, 2024

ghnp5 changed the title ~~NER - “input sequence after bert tokenization shouldn’t exceed 512 tokens” (ner_conll2003_bert)~~ NER - “input sequence after bert tokenization shouldn’t exceed 512 tokens” (ner_bert_base) Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER - “input sequence after bert tokenization shouldn’t exceed 512 tokens” (ner_bert_base) #1686

NER - “input sequence after bert tokenization shouldn’t exceed 512 tokens” (ner_bert_base) #1686

ghnp5 commented Apr 21, 2024 •

edited

ghnp5 commented Apr 24, 2024

NER - “input sequence after bert tokenization shouldn’t exceed 512 tokens” (ner_bert_base) #1686

NER - “input sequence after bert tokenization shouldn’t exceed 512 tokens” (ner_bert_base) #1686

Comments

ghnp5 commented Apr 21, 2024 • edited

ghnp5 commented Apr 24, 2024

ghnp5 commented Apr 21, 2024 •

edited