Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Different training between DDP & Sharded DDP #1172

Open
kwohlfahrt opened this issue Mar 29, 2024 · 0 comments
Open

[question] Different training between DDP & Sharded DDP #1172

kwohlfahrt opened this issue Mar 29, 2024 · 0 comments

Comments

@kwohlfahrt
Copy link

I have been comparing DDP with Fairscale Sharded DDP + OSS, and found the training progress of our model to be very different between the two setups.

After a bit of investigation, I suspected there was a race condition in the broadcasting of gradients in sharded DDP. To confirm this, I changed ShardedDataParallel._try_consume_work_handles to call _consume_work_handles instead - if I understand correctly, this should just add additional waits for all pending reduces to finish, but also be a safe change in absence of races (as it is possible that the async reduces would always be finished by the time _try_consume_work_handles is called).

This gave us three conditions to check:

  1. DDP
  2. Sharded DDP
  3. Sharded DDP (extra syncs)

We found that "DDP" and "Sharded DDP (extra syncs)" were exactly reproducible between runs, and the loss values produced were similar between the two conditions but not exactly identical. The normal "Sharded DDP" was not reproducible between runs, the first few steps were identical in repeat runs and then they would diverge. The loss values produced were also significantly different to both the baseline "DDP" and "Sharded DDP (extra syncs)".

This raises a few questions that I'd like to get some help with:

  1. Is the modification I made to add extra syncs correct? If yes, this suggests there is a race condition in at least our usage of Sharded DDP, but I don't think we're doing anything unusual.
  2. Is it expected that "Sharded DDP" and "DDP" produce significantly different training dynamics?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant