Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How does lightllm implement nopad batching? #405

Open
Tomorrowdawn opened this issue Apr 25, 2024 · 0 comments
Open

[Question] How does lightllm implement nopad batching? #405

Tomorrowdawn opened this issue Apr 25, 2024 · 0 comments

Comments

@Tomorrowdawn
Copy link

Tomorrowdawn commented Apr 25, 2024

Thanks for your great work! Here are my concerns:

Say we get a batch of inputs with lengths L1,L2,... How to simultaneously compute the attention scores of these inputs by 'nopad'? That sounds amazing but I failed to figure why when reading source code.

Additionally, in the decoding phase, how do you handle different kv length?(the code suggests kv cache is of a well-formed shape [B, num heads,...], which is confusing, because different prefixes result in different length of kv cache).

I want to implement batched speculative decoding and those details are important.

Thanks. Any detail, code or pseudo code are appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant