Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero update #421

Merged
merged 3 commits into from
May 16, 2024
Merged

Zero update #421

merged 3 commits into from
May 16, 2024

Conversation

ngc92
Copy link
Contributor

@ngc92 ngc92 commented May 16, 2024

First, clean up the multi-gpu code a bit. We only need to #ifdef away sections that would lead to compile errors, anything else can be handled with a regular plain if, making the code more readable. Also gets rid of the two copies of the update function, let's keep the unified as long as we can.

the current ZERO-1 implementation performs an all-reduce on the gradients, but only one shard is actually needed per rank.
As per the zero paper,

State-of-art implementation of all-reduce uses a two-step approach, where the first step is a reduce-scatter operation, which reduces different part of the data on different process. The next step is an all-gather operation where each process gathers the reduced data on all the process. The result of these two steps is an all-reduce.

The claim that ZERO-1 does not introduce communication overhead hinges on the fact that we just insert the adam update between these two ops. So we go from reduce-scatter -> all-gather -> adam to reduce-scatter -> adam -> all-gather.

@karpathy
Copy link
Owner

dam, good catch, we were doing a lot more communication than we needed to. merging

@karpathy karpathy merged commit 6cfc7c5 into karpathy:master May 16, 2024
8 checks passed
@karpathy
Copy link
Owner

also nice to merge the two updates into one function 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants