Warning tells you you will get indexing errors in T5 for going beyond max length #16986

marksverdhei · 2022-04-28T10:18:23Z

System Info

- `transformers` version: 4.16.2
- Python version: 3.8.12

Who can help?

@patrickvonplaten @saul

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

To reproduce:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> inputs = tokenizer("foo " * 2000, return_tensors="pt")
Outputs `Token indices sequence length is longer than the specified maximum sequence length for this model (4001 > 512). Running this sequence through the model will result in indexing errors`

>>> from transformers import AutoModelForSeq2SeqLM
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
>>> model.generate(**inputs)
tensor([[   0, 5575,   32, 5575,   32, 5575,   32, 5575,   32, 5575,   32, 5575,
           32, 5575,   32, 5575,   32, 5575,   32, 5575]])

No indexing errors

Expected behavior

The warning is wrong for T5 since it uses relative positional embeddings.
I would expect no warning, or otherwise, a warning about memory usage

I suppose this issue should apply to all models that do no have fixed length postional encodings

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2022-04-28T13:15:44Z

Thanks a lot for the issue @marksverdhei . You're right T5 has no fixed max length - so this warning is confusing.

The reason why lots of people associate T5 with a max length of 512 was that it was pretrained on a max length of 512, but is not limited to this length!

It has shown to generalize well to longer sequences. Also see: #5204

Majdoddin · 2023-04-25T08:47:51Z

I think it is a bit confusing. As in the paper, "We use a maximum sequence length of 512". Note that this is number of tokens, not the words. This I guess corresponds to max_input_length = 512 parameter. This is the maximum number of tokens that the underlying model can take. You can not change it.

But for longer text, you can do scripting to break it into 512 chunks, and feed them to the model. And I guess that is where max_source_length (length of text) is relevant.

marksverdhei · 2023-04-25T08:57:40Z

I think it is a bit confusing. As in the paper, "We use a maximum sequence length of 512". Note that this is number of tokens, not the words. This I guess corresponds to max_input_length = 512 parameter. This is the maximum number of tokens that the underlying model can take. You can not change it.

But for longer text, you can do scripting to break it into 512 chunks, and feed them to the model. And I guess that is where max_source_length (length of text) is relevant.

With T5 you can change max input length. Relative positional embeddings make it possible to process arbitrary lengths, which is what T5 uses, as opposed to classical positional embeddings such as in the original transformer architecture.
It is just that when training, a length of 512 tokens is used because it is a trade-off between
processing long-enough texts while not using too much time and memory.

marksverdhei added the bug label Apr 28, 2022

patrickvonplaten mentioned this issue Apr 28, 2022

[T5 Tokenizer] Model has no fixed position ids - there is no hardcode… #16990

Merged

5 tasks

patrickvonplaten closed this as completed in #16990 May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning tells you you will get indexing errors in T5 for going beyond max length #16986

Warning tells you you will get indexing errors in T5 for going beyond max length #16986

marksverdhei commented Apr 28, 2022 •

edited

patrickvonplaten commented Apr 28, 2022

Majdoddin commented Apr 25, 2023

marksverdhei commented Apr 25, 2023

Warning tells you you will get indexing errors in T5 for going beyond max length #16986

Warning tells you you will get indexing errors in T5 for going beyond max length #16986

Comments

marksverdhei commented Apr 28, 2022 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

patrickvonplaten commented Apr 28, 2022

Majdoddin commented Apr 25, 2023

marksverdhei commented Apr 25, 2023

marksverdhei commented Apr 28, 2022 •

edited