Why use low chunksizes? #39

aksj98 · 2023-07-10T01:37:41Z

Hi!

I saw that you have used lower chunksizes (2-4) in training of models, may I know why? I am sure 40GB of RAM in a GPU can handle more? Does it give better empirical results?

Thanks!

Muennighoff · 2023-07-10T06:59:44Z

The chunk size does not affect empirical results. Use the highest one that works for you! The higher it is the faster the training.

A few other factors affect the RAM like model size and sequence length, I think I was bottlenecked by one of them and hence had to go very low in chunk size.

aksj98 · 2023-07-10T15:50:09Z

Thanks Niklas, Had a quick question as well, I see you used a bunch of different LRs, what LR did you find to be the best? Did you also schedule the LRs in any way?

Muennighoff · 2023-07-11T06:45:24Z

I didn't experiment extensively with the LRs - I think it's based on SentenceTransformer defaults.
I found adjusting the LR alongside batch size works best. E.g. for bs=1024, I used 32e-5, so if your bs=512, I'd try 16e-5. If you go for 2048, I'd try 64e-5 etc., but searching over 2-3 values may be best.

It automatically uses a WarmupLinear, see

sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/SentenceTransformer.py

Line 616 in 9728de4

def fit(self,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why use low chunksizes? #39

Why use low chunksizes? #39

aksj98 commented Jul 10, 2023

Muennighoff commented Jul 10, 2023

aksj98 commented Jul 10, 2023

Muennighoff commented Jul 11, 2023

Why use low chunksizes? #39

Why use low chunksizes? #39

Comments

aksj98 commented Jul 10, 2023

Muennighoff commented Jul 10, 2023

aksj98 commented Jul 10, 2023

Muennighoff commented Jul 11, 2023