New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why customized optimizers? #566
Comments
@joellliu , OLMo's AdamW optimizer is simply a code organization decision to group together PyTorch's AdamW optimizer and gradient clipping and metrics-collection into a single python module. To summarize, our optimizer is not custom, in the sense that it's still PyTorch's AdamW, and it simply groups additional functionality in the same place which allows us to experiment better. Does this answer your question? |
@dumitrac Thank you so much for the reply! That answers my question! I am also curious if you have tried/ implemented any techniques to increase the throughput in multi-node distributed training. We tried both OLMo and TinyLlama in our cluster but it seems the throughput per GPU of TinyLlama drops a lot when we increase the number of nodes, but OLMo stays relatively the same. So I am curious if you have done any optimization for multi-node training. Thanks! |
@joellliu - we mainly rely on FSDP for this. |
@dumitrac Thanks for the reply! Can you elaborate more on how you avoid host-device syncs and other stalls? Thank you! |
@joellliu - the goal is to keep the GPU busy at all times (or as much as possible). Basically, the way training works on GPUs is that the “host”, i.e., the Python process, issues a bunch of instructions to the GPU (the “device”). Multiply this by that, calculate a mean, run some activation function, multiply the result by some other result, and so on, one after the other. These instructions go into a queue, and the GPU gets to work on its work queue doing all these tasks. Meanwhile, the host can keep running and do other stuff, issue more instructions, write log messages, read files, whatever. So the device can always be working, as long as the host can keep up issuing instructions. |
Thank you so much! |
❓ The question
Hi OLMo team, thanks for the great work! As I browse through the codebase, I noticed that you have implemented your own optimizers instead of using the vanilla optimizers from pytorch. I wonder what the difference is between the pytorch implementation and your implementation, and do you observe your implementation has better performance? Thank you!
The text was updated successfully, but these errors were encountered: