[Bug]: KV Cache and Max Tokens - Lack of Consistency #362

official-elinas · 2024-03-28T23:10:13Z

Your current environment

SW/Hardware;
Kubuntu 23.10
EPYC 7B13
2x 3090s
256GB DDR4 ECC

🐛 Describe the bug

Running Aphrodite as I normally do, for example:

./runtime.sh python -m aphrodite.endpoints.openai.api_server --model /media/x/nvme_1/models/Midnight-Miqu-70B-v1.5_GPTQ32G -q gptq --dtype float16 -tp 2 --port 5069 --host "192.168.1.6" --tokenizer-mode auto --max-model-len 16000 -gmu .97 --disable-log-requests --served-model-name "sambarnes/Midnight-Miqu-70B-v1.5_GPTQ32G" --enforce-eager --kv-cache-dtype fp8_e5m2

Then I get different max tokens that can be stored in the KV cache, this one being 96... the lowest I've seen.

ValueError: The model's max seq len (16000) is larger than the maximum number of tokens that can be stored in KV cache (96). Try increasing 'gpu_memory_utilization' or decreasing 'max_model_len' when initializing the engine

I brought this up in the KoboldAI discord before in the #aphrodite channel. I've seen as high as 48K and have been able to run at 32K many times.

I am on the latest commit and this has been happening for a bit, similar to a while back (months ago) when ray would fail for multiple launches.

The text was updated successfully, but these errors were encountered:

official-elinas added the bug Something isn't working label Mar 28, 2024

official-elinas changed the title ~~[Bug]: KV Cache and Max Token - Lack of Consistency~~ [Bug]: KV Cache and Max Tokens - Lack of Consistency Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: KV Cache and Max Tokens - Lack of Consistency #362

[Bug]: KV Cache and Max Tokens - Lack of Consistency #362

official-elinas commented Mar 28, 2024

[Bug]: KV Cache and Max Tokens - Lack of Consistency #362

[Bug]: KV Cache and Max Tokens - Lack of Consistency #362

Comments

official-elinas commented Mar 28, 2024

Your current environment

🐛 Describe the bug