[Question] How does lightllm implement nopad batching? #405

Tomorrowdawn · 2024-04-25T02:19:47Z

Thanks for your great work! Here are my concerns:

Say we get a batch of inputs with lengths L1,L2,... How to simultaneously compute the attention scores of these inputs by 'nopad'? That sounds amazing but I failed to figure why when reading source code.

Additionally, in the decoding phase, how do you handle different kv length?(the code suggests kv cache is of a well-formed shape [B, num heads,...], which is confusing, because different prefixes result in different length of kv cache).

I want to implement batched speculative decoding and those details are important.

Thanks. Any detail, code or pseudo code are appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How does lightllm implement nopad batching? #405

[Question] How does lightllm implement nopad batching? #405

Tomorrowdawn commented Apr 25, 2024 •

edited

[Question] How does lightllm implement nopad batching? #405

[Question] How does lightllm implement nopad batching? #405

Comments

Tomorrowdawn commented Apr 25, 2024 • edited

Tomorrowdawn commented Apr 25, 2024 •

edited