Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asymmetric search models with longer max seq length? #23

Open
regstuff opened this issue Mar 28, 2023 · 7 comments
Open

Asymmetric search models with longer max seq length? #23

regstuff opened this issue Mar 28, 2023 · 7 comments

Comments

@regstuff
Copy link

Hi
Some great work in this repo. I've been trying to get it to work in my asymmetric search application - basically a document retrieval application.
Currently I use one of the sentence transformer models trained by UKPlabs, with a max seq length of 512 tokens. But most of my documents are quite a bit longer.
Was wondering if any of the SGPT models that you or anyone else might have trained have a longer max length? Most of what I see on huggingface has a max length of 75 or 300.

Thanks

@Muennighoff
Copy link
Owner

For most models you can significantly increase the sequence length. If you load via SentenceTransformer you can do the following after loading the model:

# Change the length to 2048
model.max_seq_length = 2048

Maximum sequence length for most SGPT models on the hub is 2048, you can always check the config (https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit/blob/2dbba11efed19bb418811eac04be241ddc42eb99/config.json#L19).

Note though that the models weren't trained / evaluated on examples that long, so I'm not sure how well it performs. Would be interesting to hear your experience!

@regstuff
Copy link
Author

Thanks for the reply.
This was my main concern that though the models are capable of longer seq lengths, they are not trained for it. Was hoping you might have some experience on whether performance degrades, but I guess the ball is in my court now! :)
Another question I had was whether the SGPT models can be loaded with the bits and bytes package to reduce GPU mem usage. I only have a 16GB VRam GPU handy. The 5.8B models may be too large for 16GB VRAM

@Muennighoff
Copy link
Owner

Yeah the reason the sequence length is set to 300 during training is that it saves a lot of memory & for many cases 300 tokens are enough to determine similarity even if the actual texts are longer.

I think it can, but I also havn't tried it - I pasted some code in this issue here that might work: #19 (comment)
Let me know if it works for you!

@lpasselin
Copy link

@regstuff did you get good results when bumping the sequence length?

@regstuff
Copy link
Author

regstuff commented Apr 5, 2023

@regstuff did you get good results when bumping the sequence length?

Frankly, I havent been able to figure out a sensible way of measuring the quality. Any ideas welcome.

@r100-stack
Copy link

To confirm, is there a difference between sequence length and token length? Or do they mean the same?

@Muennighoff
Copy link
Owner

To confirm, is there a difference between sequence length and token length? Or do they mean the same?

It's the same, i.e. sequence length is measured in tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants