Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why use low chunksizes? #39

Open
aksj98 opened this issue Jul 10, 2023 · 3 comments
Open

Why use low chunksizes? #39

aksj98 opened this issue Jul 10, 2023 · 3 comments

Comments

@aksj98
Copy link
Contributor

aksj98 commented Jul 10, 2023

Hi!

I saw that you have used lower chunksizes (2-4) in training of models, may I know why? I am sure 40GB of RAM in a GPU can handle more? Does it give better empirical results?

Thanks!

@Muennighoff
Copy link
Owner

The chunk size does not affect empirical results. Use the highest one that works for you! The higher it is the faster the training.

A few other factors affect the RAM like model size and sequence length, I think I was bottlenecked by one of them and hence had to go very low in chunk size.

@aksj98
Copy link
Contributor Author

aksj98 commented Jul 10, 2023

Thanks Niklas, Had a quick question as well, I see you used a bunch of different LRs, what LR did you find to be the best? Did you also schedule the LRs in any way?

@Muennighoff
Copy link
Owner

I didn't experiment extensively with the LRs - I think it's based on SentenceTransformer defaults.
I found adjusting the LR alongside batch size works best. E.g. for bs=1024, I used 32e-5, so if your bs=512, I'd try 16e-5. If you go for 2048, I'd try 64e-5 etc., but searching over 2-3 values may be best.

It automatically uses a WarmupLinear, see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants