Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why customized optimizers? #566

Closed
joellliu opened this issue May 2, 2024 · 6 comments
Closed

why customized optimizers? #566

joellliu opened this issue May 2, 2024 · 6 comments
Labels
type/question An issue that's a question

Comments

@joellliu
Copy link

joellliu commented May 2, 2024

❓ The question

Hi OLMo team, thanks for the great work! As I browse through the codebase, I noticed that you have implemented your own optimizers instead of using the vanilla optimizers from pytorch. I wonder what the difference is between the pytorch implementation and your implementation, and do you observe your implementation has better performance? Thank you!

@joellliu joellliu added the type/question An issue that's a question label May 2, 2024
@dumitrac
Copy link
Contributor

dumitrac commented May 3, 2024

@joellliu , OLMo's AdamW optimizer is simply a code organization decision to group together PyTorch's AdamW optimizer and gradient clipping and metrics-collection into a single python module.
Note that the gradient clipping we've been using is functionally the same as FSDP.clip_grad_norm_(), but we also experimented with other forms of clipping, which had mixed results.

To summarize, our optimizer is not custom, in the sense that it's still PyTorch's AdamW, and it simply groups additional functionality in the same place which allows us to experiment better.

Does this answer your question?

@joellliu
Copy link
Author

joellliu commented May 3, 2024

@dumitrac Thank you so much for the reply! That answers my question! I am also curious if you have tried/ implemented any techniques to increase the throughput in multi-node distributed training. We tried both OLMo and TinyLlama in our cluster but it seems the throughput per GPU of TinyLlama drops a lot when we increase the number of nodes, but OLMo stays relatively the same. So I am curious if you have done any optimization for multi-node training. Thanks!

@dumitrac
Copy link
Contributor

dumitrac commented May 6, 2024

@joellliu - we mainly rely on FSDP for this.
We have done some work with the profiler to avoid host-device syncs and other stalls, making sure the GPUs stay busy at all times.

@joellliu
Copy link
Author

joellliu commented May 7, 2024

@dumitrac Thanks for the reply! Can you elaborate more on how you avoid host-device syncs and other stalls? Thank you!

@dumitrac
Copy link
Contributor

dumitrac commented May 7, 2024

@joellliu - the goal is to keep the GPU busy at all times (or as much as possible).
I'm quoting below @dirkgr 's description below:

Basically, the way training works on GPUs is that the “host”, i.e., the Python process, issues a bunch of instructions to the GPU (the “device”). Multiply this by that, calculate a mean, run some activation function, multiply the result by some other result, and so on, one after the other. These instructions go into a queue, and the GPU gets to work on its work queue doing all these tasks. Meanwhile, the host can keep running and do other stuff, issue more instructions, write log messages, read files, whatever. So the device can always be working, as long as the host can keep up issuing instructions.
But sometimes, the host needs to make a decision based on the results of what the GPU is calculating. For example, when it wants to write a log of what the latest loss is. When it wants to check whether we’re still on track. When it wants to decide whether it’s time to write a checkpoint. Things like that. And at those times, the host needs to wait until the device has worked down its queue until it’s empty. Then the host does something with the result, and only then does it start filling up the queue again. This takes a long time, and it can often be subtle to see from the code when it happens.
You can work around this by either
a) not doing this. Just don’t do the action that causes a host-device sync. Or do it less frequently.
b) GPUs actually have multiple work queues, so you can try to make sure at least one of them is always full.

@joellliu
Copy link
Author

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

2 participants