Asymmetric search models with longer max seq length? #23

regstuff · 2023-03-28T07:42:41Z

Hi
Some great work in this repo. I've been trying to get it to work in my asymmetric search application - basically a document retrieval application.
Currently I use one of the sentence transformer models trained by UKPlabs, with a max seq length of 512 tokens. But most of my documents are quite a bit longer.
Was wondering if any of the SGPT models that you or anyone else might have trained have a longer max length? Most of what I see on huggingface has a max length of 75 or 300.

Thanks

Muennighoff · 2023-03-28T07:51:13Z

For most models you can significantly increase the sequence length. If you load via SentenceTransformer you can do the following after loading the model:

# Change the length to 2048
model.max_seq_length = 2048

Maximum sequence length for most SGPT models on the hub is 2048, you can always check the config (https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit/blob/2dbba11efed19bb418811eac04be241ddc42eb99/config.json#L19).

Note though that the models weren't trained / evaluated on examples that long, so I'm not sure how well it performs. Would be interesting to hear your experience!

regstuff · 2023-03-28T08:38:05Z

Thanks for the reply.
This was my main concern that though the models are capable of longer seq lengths, they are not trained for it. Was hoping you might have some experience on whether performance degrades, but I guess the ball is in my court now! :)
Another question I had was whether the SGPT models can be loaded with the bits and bytes package to reduce GPU mem usage. I only have a 16GB VRam GPU handy. The 5.8B models may be too large for 16GB VRAM

Muennighoff · 2023-03-28T08:48:41Z

Yeah the reason the sequence length is set to 300 during training is that it saves a lot of memory & for many cases 300 tokens are enough to determine similarity even if the actual texts are longer.

I think it can, but I also havn't tried it - I pasted some code in this issue here that might work: #19 (comment)
Let me know if it works for you!

lpasselin · 2023-04-05T16:45:07Z

@regstuff did you get good results when bumping the sequence length?

regstuff · 2023-04-05T16:55:41Z

@regstuff did you get good results when bumping the sequence length?

Frankly, I havent been able to figure out a sensible way of measuring the quality. Any ideas welcome.

r100-stack · 2023-04-12T21:10:19Z

To confirm, is there a difference between sequence length and token length? Or do they mean the same?

Muennighoff · 2023-04-12T21:20:51Z

To confirm, is there a difference between sequence length and token length? Or do they mean the same?

It's the same, i.e. sequence length is measured in tokens

Muennighoff mentioned this issue Aug 9, 2023

If I input more than the max_seq_length？ #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asymmetric search models with longer max seq length? #23

Asymmetric search models with longer max seq length? #23

regstuff commented Mar 28, 2023

Muennighoff commented Mar 28, 2023

regstuff commented Mar 28, 2023

Muennighoff commented Mar 28, 2023

lpasselin commented Apr 5, 2023

regstuff commented Apr 5, 2023

r100-stack commented Apr 12, 2023

Muennighoff commented Apr 12, 2023

Asymmetric search models with longer max seq length? #23

Asymmetric search models with longer max seq length? #23

Comments

regstuff commented Mar 28, 2023

Muennighoff commented Mar 28, 2023

regstuff commented Mar 28, 2023

Muennighoff commented Mar 28, 2023

lpasselin commented Apr 5, 2023

regstuff commented Apr 5, 2023

r100-stack commented Apr 12, 2023

Muennighoff commented Apr 12, 2023