Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set the initial kv cache length? #1577

Closed
liminn opened this issue May 11, 2024 · 4 comments
Closed

How to set the initial kv cache length? #1577

liminn opened this issue May 11, 2024 · 4 comments
Assignees
Labels
question Further information is requested triaged Issue has been triaged by maintainers

Comments

@liminn
Copy link

liminn commented May 11, 2024

I want to test an example: the initial kv cache length is 2048, and LLM iterate 2048 times, so the output_tokens=2048, but the initial kv cache length is 2048, and the final kv cache length is 4096(2048+2048).

if I run:

FT_NVTX=ON /opt/nvidia/nsight-systems/2024.2.1/bin/nsys profile mpirun  -n 8 --allow-run-as-root --oversubscribe ./cpp/build/benchmarks/gptSessionBenchmark --engine_dir ./benchmarks/cpp/temp/engine_out_builddocker_tp8/ --warm_up 1 --batch_size "64" --duration 0 --num_runs 1 --input_output_len "1,2048"

the initial kv cache length is 1, not 2048.
So, how to set the initial kv cache length?

@byshiue
Copy link
Collaborator

byshiue commented May 14, 2024

You should set --input_output_len "2048,2048".

@byshiue byshiue self-assigned this May 15, 2024
@byshiue byshiue added question Further information is requested triaged Issue has been triaged by maintainers labels May 15, 2024
@liminn
Copy link
Author

liminn commented May 17, 2024

Sorry, I may not have expressed my meaning clearly.
If I set -- input_output_len "2048,2048", then I understand that it includes two part time:

  • part 1: one Prefill inference time (input sequence length is 2048, initial kv cache length is 0)
  • part 2: 2047 Decoding iteration inference times (input sequence length is actually 1, initial kv cache length is 2048), right?

However, I only want to test the inference time of part 2, so how can I set it?

@byshiue
Copy link
Collaborator

byshiue commented May 23, 2024

There is no way to measure that directly. You could use nsys to measure the whole workflow, and calculate the time of part 2 manually.

@liminn
Copy link
Author

liminn commented May 23, 2024

ok, thanks

@liminn liminn closed this as completed May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants