-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: Automatic Prefix Caching in multi-turn conversations #4917
Comments
CC @robertgshaw2-neuralmagic (the tweet said the feature was added by Neural Magic, so you might have some insight into this feature) |
Will take a look at this case |
I'm also interested in this issue so I benchmarked today using the latest main branch, which already uses flash-attn kernel for prefix caching. But even I've verified cache hit in prefix cache, I also found no speedup by running the above script. I'll also investigate a bit. |
cc @SageMoore fyi |
I am not sure what GPU this is, but on an A100, we can do ~15000 prefill tokens/sec at fp16. So even a 2000 token prefill should only take 0.13 seconds to process. Since APC skips prefill computation, there are only 0.5s worth of time that can be optimized in this case. As a result, I would not really expect to see a speedup in this case (+ in fact there is some overhead associated with managing another layer of indirection) APC really is useful for cases with long shared prefills and short decodes, such as:
|
Thanks for the good hint. I instead let the script report the latency of every request instead of the total time, and here are the results on L4 GPU: w/o APC
w. APC
It seems align to what you analyzed. |
My results came from A100 40GB with
Ok, so it's simply a case of my test not being suitable. If I was running a model with a more expensive prefill (i.e. bigger than 7B) and with longer prompts, I'd start to be able to observe the difference in a single conversation (albeit a subtle difference). Presumably there is also a concurrency benefit too, because the slot that would have been scheduled to execute the cached prefill can be used to process the prefill (or decoding) of a different request? |
The key thing for automatic prefix caching to have a sizable improvement is that the ratio between input token length and output token length should be VERY VERY large (ideally more than 100x difference). This is a very strong workload requirement, and such type of workload only commonly occurs in specific applications (e.g. asking questions to a very long software manual). |
I ran a better test and have an interesting graph: Regardless of first prompt size, there seems to be a large fixed cost on turn 1 (i.e. the second turn), but not the subsequent turns.
|
@hmellor - this is caused by Triton jitting. The first time the server runs the context_fwd_attention, Triton jits which slows us down. Have been meaning to finish off a PR that runs the JITing durin profiling, but has become lower priority since if you use latest main with flash attention this issue is resolved b/c it uses the flash attn kernels rather than triton for context_fwd_attn |
note: this will happen once per instantiation of the server |
Is that the flash attention from the " |
Yes. You can just pip install vllm-flash-attn and make sure seeing the log |
I think its now installed automatically https://github.com/vllm-project/vllm/blob/main/setup.py#L356 |
Ok, thanks for clearing that up for me! |
I'm interested in the automatic prefix caching feature for multi-turn conversations but I can't seem to observe a performance improvement when prefix caching is enabled. This tweet from @vllm_project indicates that automatic prefix caching should benefit this use case.
I am using the following commands to start the vLLM server:
And the following script to simulate a multi turn conversation from a user:
With automatic prefix caching disabled I see:
And with automatic prefix caching enabled I see:
Is this expected?
The text was updated successfully, but these errors were encountered: