Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Recovering logic of a long evicted request is broken #163

Open
masahi opened this issue Jan 17, 2024 · 1 comment
Open

[Bug] Recovering logic of a long evicted request is broken #163

masahi opened this issue Jan 17, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@masahi
Copy link
Member

masahi commented Jan 17, 2024

https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/engine_common.py#L385-L399

For streaming case, we cannot clamp the generated tokens and recompute them.
Moreover, since the clamping logic is done in the worker but not in the main process, the discrepancy arises between the main and the worker process. See #158 and #164.

We need to either

@elvin-n @sunggg

@masahi masahi added the bug Something isn't working label Jan 17, 2024
Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this issue Jan 30, 2024
…/ Update config name (octoml#163)

This PR updates three places for better experience.
* Unify the `--model-path` and `--model` args in build.py. Now we only
take `--model`.
* Hardcode the rotary embedding size for LLaMA to 2048. This enables us
to build a model with different max sequence length without changing the
built weights.
* Update the generated config file name to `mlc-chat-config.json`.
@masahi masahi changed the title [Bug] Recovering logic of a long evicted request is broken for streaming case [Bug] Recovering logic of a long evicted request is broken Feb 1, 2024
@masahi
Copy link
Member Author

masahi commented Feb 1, 2024

@elvin-n After #157 lands, you can follow a similar strategy to use multiple EvalMultiQueryRequest to split restoring of a long request into several batches, each of which fits into max_num_batched_tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant