Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding flash attention for sequence parallel #565

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open

Conversation

dianaml0
Copy link
Contributor

@dianaml0 dianaml0 commented Dec 23, 2022

Patch Description
Creating this PR off of #511, so it can be reviewed by @stephenroller

The last commit (3d709db) removes some changes from the sequence parallel code which enabled testing with world size of 1. CI is not currently running the test anyway because CI needs to be updated for the test to run.

The forward and backward tests are passing right now. However in some cases, about .2% of the elements fail

Testing steps
Unit Test gpu_tests/test_sequence_parallel_transformer_layer.py

@dianaml0
Copy link
Contributor Author

CircleCI failure not related to this PR

@stephenroller
Copy link
Contributor

Can we rebase for checks? Should we be concerned about the last bits of numerical differences?

@dianaml0
Copy link
Contributor Author

dianaml0 commented Jan 3, 2023

@stephenroller just rebased the PR, should be up to date now. The rtol and atol used are the same ones we use for testing in xFormers for all flash attention bwds. I do a small training run to validate, would that be useful?

@dianaml0
Copy link
Contributor Author

dianaml0 commented Jan 4, 2023

Looks like everything is passing now after rebasing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants