[Misc]: a question about chunked-prefill in flash-attn backends #4863

HarryWu99 · 2024-05-16T14:50:21Z

Anything you want to discuss about vllm.

vllm/vllm/attention/backends/flash_attn.py

Line 282 in 99caa49

if prefill_meta := attn_metadata.prefill_metadata:

I noticed that in flash-attn backends. forward_prefix and forward_decode seem to be executed serially. Does forward_decode wait for forward_prefix to finish before running? Can this take advantage of the performance provided by chunked-prefill? I mean the tokens of prefill are in the same batch as the tokens of decode.

if prefill_meta := attn_metadata.prefill_metadata:
    output[:num_prefill_tokens] = PagedAttention.forward_prefix(...)

if decode_meta := attn_metadata.decode_metadata:
    output[num_prefill_tokens:] = PagedAttention.forward_decode(...)

The text was updated successfully, but these errors were encountered:

rkooo567 · 2024-05-18T06:48:35Z

I noticed that in flash-attn backends. forward_prefix and forward_decode seem to be executed serially. Does forward_decode wait for forward_prefix to finish before running? Can this take advantage of the performance provided by chunked-prefill? I mean the tokens of prefill are in the same batch as the tokens of decode.

Yeah right now, it is running serially. I think after #4681, it should be possible to run them in the same attn kernel, but based on our past internal benchmark before, it didn't make much difference (we can definitely try to see how much perf improvement it will have). But this could be different now.

Note that this should be done after we re-revert #4820 because we should use prefix kernel to run both in the same attn kernel, and existing prefix kernel is too slow (flash attn varlen has at least 3X faster than this kernel)

CrimsonDump · 2024-05-20T02:04:20Z

I noticed that in flash-attn backends. forward_prefix and forward_decode seem to be executed serially. Does forward_decode wait for forward_prefix to finish before running? Can this take advantage of the performance provided by chunked-prefill? I mean the tokens of prefill are in the same batch as the tokens of decode.

Yeah right now, it is running serially. I think after #4681, it should be possible to run them in the same attn kernel, but based on our past internal benchmark before, it didn't make much difference (we can definitely try to see how much perf improvement it will have). But this could be different now.

Note that this should be done after we re-revert #4820 because we should use prefix kernel to run both in the same attn kernel, and existing prefix kernel is too slow (flash attn varlen has at least 3X faster than this kernel)

Is there a Issue/PR to "re-revert #4820 " for us to track?

HarryWu99 added the misc label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc]: a question about chunked-prefill in flash-attn backends #4863

[Misc]: a question about chunked-prefill in flash-attn backends #4863

HarryWu99 commented May 16, 2024

rkooo567 commented May 18, 2024 •

edited

CrimsonDump commented May 20, 2024

[Misc]: a question about chunked-prefill in flash-attn backends #4863

[Misc]: a question about chunked-prefill in flash-attn backends #4863

Comments

HarryWu99 commented May 16, 2024

Anything you want to discuss about vllm.

rkooo567 commented May 18, 2024 • edited

CrimsonDump commented May 20, 2024

rkooo567 commented May 18, 2024 •

edited