[Sync torch_FA2 and FA2 flash_api] + [Expose seqused_k & alibi_slopes arguments] #126520

lvaleriu · 2024-05-17T07:46:15Z

Expose seqused_k & alibi_slopes arguments:

This can be used when your sequence length k is not the full extent of the tensor. This is useful for kv cache scenarios and was not previously supported in the FA2 TORCH integration. We need these arguments for external xformers lib call to the _flash_attention_forward API.
Before:

  std::optional<Tensor> seqused_k = c10::nullopt;
  std::optional<Tensor> alibi_slopes = c10::nullopt;

After:

_flash_attention_forward(...
    std::optional<Tensor>& seqused_k,
    std::optional<Tensor>& alibi_slopes,

There is a difference between the TORCH_FA2_flash_api:mha_fwd and FA2_flash_api:mha_fwd (same for mha_varlen_fwd) at the query transposition (GQA) step.

The CHECK_SHAPE is applied on the original query vs the reshaped query. This causes an error (because of the shape constraint) for such inputs:

q = torch.randn([7, 1, 4, 256], dtype=torch.bfloat16, device='cuda')
k = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda')
v = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda')

i've modified the code as little as possible, but if you prefer a more verbose change like the following, dont hesitate to tell me:

at::Tensor swapped_q = seqlenq_ngroups_swapped 
    ? q.reshape({batch_size, num_heads_k, num_heads / num_heads_k, head_size_og}).transpose(1, 2)
    : q;

if (seqlenq_ngroups_swapped) {
    seqlen_q = num_heads / num_heads_k;
    num_heads = num_heads_k;
}

CHECK_SHAPE(swapped_q, batch_size, seqlen_q, num_heads, head_size_og);

pytorch-bot · 2024-05-17T07:46:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126520

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Upgrade MacOS runner to 14

❌ 58 New Failures, 5 Unrelated Failures

As of commit e8c29f3 with merge base ec8b254 ():

NEW FAILURES - The following jobs have failed:

inductor / cuda12.1-py3.10-gcc9-sm80 / build (gh)
##[error]The operation was canceled.
inductor / cuda12.1-py3.10-gcc9-sm86 (gh)
inductor / cuda12.1-py3.10-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
inductor / cuda12.1-py3.12-gcc9-sm86 (gh)
inductor / cuda12.1-py3.12-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
inductor / cuda12.4-py3.10-gcc9-sm80 / build (gh)
##[error]The operation was canceled.
inductor / cuda12.4-py3.10-gcc9-sm86 (gh)
inductor / cuda12.4-py3.10-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
inductor / cuda12.4-py3.12-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
inductor / linux-jammy-cpu-py3.8-gcc11-inductor (gh)
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / build (gh)
##[error]The operation was canceled.
inductor / rocm6.1-py3.8-inductor / build (gh)
##[error]The operation was canceled.
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / build (gh)
##[error]The operation was canceled.
Lint / lintrunner-clang / linux-job (gh)
##[error]The operation was canceled.
Lint / lintrunner-noclang / linux-job (gh)
##[error]The operation was canceled.
Lint / quick-checks / linux-job (gh)
##[error]The operation was canceled.
Lint / Test collect_env (with_torch) (gh)
##[error]The operation was canceled.
Lint / Test tools / linux-job (gh)
##[error]The operation was canceled.
Lint / toc / linux-job (gh)
##[error]The operation was canceled.
Lint / workflow-checks / linux-job (gh)
##[error]The operation was canceled.
pull / before-test / llm-retrieval (gh)
##[error]The operation was canceled.
pull / linux-docs (gh)
pull / linux-focal-cpu-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge) (gh)
pull / linux-focal-cuda11.8-py3.10-gcc9 (gh)
pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
pull / linux-focal-cuda12.1-py3.10-gcc9 (gh)
pull / linux-focal-cuda12.1-py3.10-gcc9 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh)
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 (gh)
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.4-py3.10-gcc9 (gh)
pull / linux-focal-cuda12.4-py3.10-gcc9 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh)
pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 (gh)
pull / linux-focal-py3_8-clang9-xla (gh)
pull / linux-focal-py3_8-clang9-xla / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single / build-and-test (default, 1, 1, linux.2xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single-full-jit / build-and-test (default, 1, 1, linux.2xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-mobile-custom-build-static / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.11-clang10 (gh)
pull / linux-focal-py3.11-clang10 / build (gh)
pull / linux-focal-py3.12-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.8-clang10 (gh)
pull / linux-focal-py3.8-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.8-clang10-onnx (gh)
pull / linux-focal-py3.8-clang10-onnx / build (gh)
##[error]The operation was canceled.
pull / linux-focal-rocm6.1-py3.8 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-cuda11.8-cudnn8-py3.8-clang12 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3-clang12-executorch (gh)
pull / linux-jammy-py3-clang12-executorch / build (gh)
pull / linux-jammy-py3-clang12-mobile-build / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.10-clang15-asan (gh)
pull / linux-jammy-py3.10-clang15-asan / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.8-gcc11 (gh)
pull / linux-jammy-py3.8-gcc11 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.8-gcc11-mobile-lightweight-dispatch-build / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.8-gcc11-no-ops / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.8-gcc11-pch / build (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / win-vs2019-cpu-py3 / build (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

linux-binary-manywheel / manywheel-py3_8-cuda11_8-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / build (gh) (#127104)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg

nice catch

drisspg · 2024-05-17T16:21:26Z

@pytorchbot merge

pytorchmergebot · 2024-05-17T16:23:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-17T16:23:30Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

drisspg · 2024-05-17T17:27:57Z

@pytorchbot rebase

pytorchmergebot · 2024-05-17T17:29:41Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-05-17T17:29:45Z

Successfully rebased fix_fa2_gqa onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_fa2_gqa && git pull --rebase)

drisspg · 2024-05-17T19:37:29Z

@pytorchbot merge

pytorchmergebot · 2024-05-17T19:39:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-17T19:39:46Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

drisspg · 2024-05-24T16:19:41Z

These changes look good

lvaleriu · 2024-05-27T14:46:21Z

@pytorchbot merge

pytorchmergebot · 2024-05-27T14:49:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-27T14:49:40Z

Merge failed

Reason: 15 jobs have failed, first few of them are: inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable)

Details for Dev Infra team

Raised by workflow job

… _flash_attention_forward

…unction & copy ref

lvaleriu · 2024-05-29T11:52:26Z

@pytorchbot merge

pytorchmergebot · 2024-05-29T11:54:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

… arguments] (pytorch#126520) 1. **Expose seqused_k & alibi_slopes arguments**: - This can be used when your sequence length k is not the full extent of the tensor. This is useful for kv cache scenarios and was not previously supported in the FA2 TORCH integration. We need these arguments for external xformers lib call to the _flash_attention_forward API. Before: ``` std::optional<Tensor> seqused_k = c10::nullopt; std::optional<Tensor> alibi_slopes = c10::nullopt; ``` After: ``` _flash_attention_forward(... std::optional<Tensor>& seqused_k, std::optional<Tensor>& alibi_slopes, ``` 2. There is a difference between the **TORCH_FA2_flash_api:mha_fwd** and **FA2_flash_api:mha_fwd** (same for **mha_varlen_fwd**) at the query transposition (GQA) step. The **CHECK_SHAPE** is applied on the original query vs the reshaped query. This causes an error (because of the shape constraint) for such inputs: ``` q = torch.randn([7, 1, 4, 256], dtype=torch.bfloat16, device='cuda') k = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda') v = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda') ``` ![image](https://github.com/pytorch/pytorch/assets/927999/77ea6bf6-b6e9-4f3f-96a9-8d952956ddd9) - i've modified the code as little as possible, but if you prefer a more verbose change like the following, dont hesitate to tell me: ``` at::Tensor swapped_q = seqlenq_ngroups_swapped ? q.reshape({batch_size, num_heads_k, num_heads / num_heads_k, head_size_og}).transpose(1, 2) : q; if (seqlenq_ngroups_swapped) { seqlen_q = num_heads / num_heads_k; num_heads = num_heads_k; } CHECK_SHAPE(swapped_q, batch_size, seqlen_q, num_heads, head_size_og); ``` Pull Request resolved: pytorch#126520 Approved by: https://github.com/drisspg

lvaleriu requested a review from drisspg May 17, 2024 07:46

drisspg approved these changes May 17, 2024

View reviewed changes

drisspg added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels May 17, 2024

pytorchmergebot added the merging label May 17, 2024

pytorchmergebot removed the merging label May 17, 2024

pytorchmergebot force-pushed the fix_fa2_gqa branch from ab8d295 to 8e82ea8 Compare May 17, 2024 17:29

drisspg closed this May 17, 2024

drisspg reopened this May 17, 2024

pytorchmergebot added the merging label May 17, 2024

pytorchmergebot removed the merging label May 17, 2024

lvaleriu requested review from albanD and soulitzer as code owners May 21, 2024 12:58

pytorch-bot bot added the ciflow/inductor label May 21, 2024

lvaleriu changed the title ~~check_shape for the transposed query~~ [Sync torch_FA2 and FA2 flash_api] + [Expose seqused_k & alibi_slopes arguments] May 21, 2024

pytorchmergebot added the merging label May 27, 2024

pytorchmergebot removed the merging label May 27, 2024

lvaleriu added 6 commits May 28, 2024 10:02

check_shape for the transposed query

ff9bdbd

CHECK_SHAPE on the transposd query in mha_varlen_fwd

b4e70bc

add seqused_k and alibi_slopes optional arguments to the signature of…

5255800

… _flash_attention_forward

changing seqused_k and alibi_slopes arg types

e45767f

added _seqused_k and _alibi_slopes const ref args at the end of the f…

03d4e04

…unction & copy ref

linter fix

e8c29f3

lvaleriu force-pushed the fix_fa2_gqa branch from 3a15f25 to e8c29f3 Compare May 28, 2024 10:04

pytorchmergebot added the merging label May 29, 2024

pytorchmergebot closed this in 02b1cda May 29, 2024

pytorchmergebot added Merged and removed merging labels May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sync torch_FA2 and FA2 flash_api] + [Expose seqused_k & alibi_slopes arguments] #126520

[Sync torch_FA2 and FA2 flash_api] + [Expose seqused_k & alibi_slopes arguments] #126520

lvaleriu commented May 17, 2024 •

edited

pytorch-bot bot commented May 17, 2024 •

edited

drisspg left a comment

drisspg commented May 17, 2024

pytorchmergebot commented May 17, 2024

pytorchmergebot commented May 17, 2024

drisspg commented May 17, 2024

pytorchmergebot commented May 17, 2024

pytorchmergebot commented May 17, 2024

drisspg commented May 17, 2024

pytorchmergebot commented May 17, 2024

pytorchmergebot commented May 17, 2024

drisspg commented May 24, 2024

lvaleriu commented May 27, 2024

pytorchmergebot commented May 27, 2024

pytorchmergebot commented May 27, 2024

lvaleriu commented May 29, 2024

pytorchmergebot commented May 29, 2024

[Sync torch_FA2 and FA2 flash_api] + [Expose seqused_k & alibi_slopes arguments] #126520

[Sync torch_FA2 and FA2 flash_api] + [Expose seqused_k & alibi_slopes arguments] #126520

Conversation

lvaleriu commented May 17, 2024 • edited

pytorch-bot bot commented May 17, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126520

❗ 1 Active SEVs

❌ 58 New Failures, 5 Unrelated Failures

drisspg left a comment

Choose a reason for hiding this comment

drisspg commented May 17, 2024

pytorchmergebot commented May 17, 2024

Merge started

pytorchmergebot commented May 17, 2024

Merge failed

drisspg commented May 17, 2024

pytorchmergebot commented May 17, 2024

pytorchmergebot commented May 17, 2024

drisspg commented May 17, 2024

pytorchmergebot commented May 17, 2024

Merge started

pytorchmergebot commented May 17, 2024

Merge failed

drisspg commented May 24, 2024

lvaleriu commented May 27, 2024

pytorchmergebot commented May 27, 2024

Merge started

pytorchmergebot commented May 27, 2024

Merge failed

lvaleriu commented May 29, 2024

pytorchmergebot commented May 29, 2024

Merge started

lvaleriu commented May 17, 2024 •

edited

pytorch-bot bot commented May 17, 2024 •

edited