Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fsdp ptd #1085

Draft
wants to merge 2 commits into
base: fixing_memory_issues_with_keeping_overlap_may24
Choose a base branch
from
Draft

Conversation

ruanslv
Copy link
Contributor

@ruanslv ruanslv commented Sep 27, 2022

  • Disable prefetching (re-enable 2 PG backward)
  • Check queue at the beginning of all-gather
  • Update queue after a free event

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 27, 2022
@min-xu-ai
Copy link
Contributor

Just curious, is this PR trying to the "limiter" thing as PTD's fsdp here?

@ruanslv
Copy link
Contributor Author

ruanslv commented Sep 29, 2022

Yes exactly. events[-1].synchronize() doesn't seem to be working as is, however.

@cyugao cyugao marked this pull request as ready for review October 7, 2022 22:05
@cyugao cyugao marked this pull request as draft October 7, 2022 22:08
@awgu
Copy link

awgu commented Oct 12, 2022

As a heads up, I made a small change to improve the rate limiter: pytorch/pytorch#86165

It is actually more performant to only dequeue and synchronize one event compared to flushing the queue. The overhead to consider is not from the CPU dequeueing and calling .synchronize() but rather the time taken to block the CPU thread during .synchronize(). Dequeueing only one reduces gaps in the communication stream, as can be seen in the PR summary. Under this new approach, a limit of 2 makes sense because it is the minimal limit to still achieve overlap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants