Skip to content

Why ZeRO-offload parameter update after model backward? can do them pipeline? #5478

Answered by GuanhuaWang
Zijie-Tian asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @Zijie-Tian, what is U1234 here? I guess it is step on CPU side.

The main reason is because CPU compute is super slow compared with GPU. In your pipeline case, the first needed F1/P1 will be the last updated params (wait until u4321 all finished) on CPU thus have the longest delay. Therefore, if doing such pipeline, CPU will be the bottleneck of the whole training pipeline.

Because of this we also did some opitmizations of delaying 1 iteration param updates as described in paper https://arxiv.org/pdf/2101.06840, section 5

Second, we develop a one-step delayed parameter update schedule that overlaps the CPU parameter update computation with the forward and backward computation on the G…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@Zijie-Tian
Comment options

Answer selected by Zijie-Tian
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants