Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What H/W do you need to to fine tune PolyCoder? #36

Open
smith-co opened this issue Nov 17, 2022 · 1 comment
Open

What H/W do you need to to fine tune PolyCoder? #36

smith-co opened this issue Nov 17, 2022 · 1 comment

Comments

@smith-co
Copy link

I would like to fine tune the PolyCoder model.

What are the H/W requirements to fine tune the PolyCoder model?

What are the GPU reuirements?

@VHellendoorn
Copy link
Owner

Assuming you're asking about the largest model (~2.7B non-embedding parameters), you should definitely aim for multiple, relatively large GPUs. While the model weights themselves only take up ~6GB in half-precision, adding the complete optimizer states puts the memory footprint closer to 40GB, which means it'll be almost impossible to train in almost any single GPU (except maybe the 80GB edition of the A100).

In multi-GPU settings, most of this memory would be spread across the available cards, so you just need enough space left to fit reasonable batch sizes. I can imagine you could train this with a 2-GPU machine if both GPUs have 40GB+ of RAM (which includes GPUs like the A100, A6000/8000 or RTX 8000), but you'd probably be able to fit very few sequences (like, 1 or 2) at the time. When training these models, the total batch size is quite important. If you fine-tune it with sequences of 2K tokens, as we did, you'll want to make sure each batch contains at least 128 (but preferabley 256+) such sequences to make sure training is stable enough. When your devices can't handle that big a batch in one go, they can instead use gradient accumulation to run through many sequences before each gradient update. That does mean each batch can take a long time, which is why it's better to have 4 or 8 GPUs available.

How much capacity you need will depend on how much data you have and how long you want to fine-tune for. E.g., if you have 1B new tokens and want to fine-tune with batches of 256K tokens each (128 sequences), you'll need to train for 4K steps, which isn't very long. A few days on two A6000s with a very small micro-batch size, e.g. 1 sequence per device (and a correspondingly large number of gradient accumulation steps, in this case 128 / 2 = 64) should suffice, provided they have enough memory. If you have way more data, you'll want to use bigger hardware. As an anecdotal example: in the second phase of training (going from 100K to 150K steps), PolyCoder-2.7B was trained on 4 RTX 8000 (48GB) GPUs with gradient accumulation. This took about 3 weeks, if I recall correctly, for 50K steps, so ~2K steps per day on 4 GPUs.

Fine-tuning on smaller GPUs, like RTX Titans with 24GB of RAM might be difficult. I'm not sure if the toolkit will natively split the optimizer states into small enough chunks to make this work, even with 4-8 of those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants