The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. #1856

rxsalad · 2024-05-04T19:03:25Z

System Info

Test with llama-3-8b-instruct (around 16 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM

Initial VRAM usage: 0.7 GB
The VRAM usage after the model is loaded: 21.1 GB, setting max batch total tokens to 38928
The VRAM usage after the batched inference, 16 x ( Prompt 512, Decode 512 ): 22.5 GB
The VRAM usage after the batched inference, 24 x ( Prompt 512, Decode 512 ): 23.8 GB
So it can support 24 x ( Prompt 512, Decode 512 ).

Test with llama-3-8b-instruct-awq (around 6 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM

Initial VRAM usage: 0.7 GB
The VRAM usage after the model is loaded: 19.2 GB, setting max batch total tokens to 104384
The VRAM usage after the batched inference, 8 x ( Prompt 512, Decode 512 ): 23.2 GB
The VRAM usage after the batched inference, 16 x ( Prompt 512, Decode 512 ): 24.5 GB
So it can only support 8 x ( Prompt 512, Decode 512 ).

The questions are:

1)Why TGI reserves a significant amount of VRAM and what it is used for?
2)Why does the VRAM usage keep growing after TGI reserves a large amount of VRAM during the inference?

It's hard to believe that the quantized model saves almost 10 GB in size compared to the standard Llama 3 8B, yet can handle much less batch requests on the same hardware.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

model=casperhansen/llama-3-8b-instruct-awq
text-generation-launcher --quantize awq --model-id $model

model=meta-llama/Meta-Llama-3-8B-Instruct
text-generation-launcher --model-id $model

text-generation-benchmark --tokenizer-name $model --batch-size 1 --batch-size 2 --batch-size 4 --batch-size 8 --batch-size 16 --batch-size 24 --sequence-length 512 --decode-length 512

Expected behavior

The quantized model should be able to support larger batch requests compared to the standard Llama 3 8B, given its smaller size ( 6GB vs. 16 GB).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. #1856

The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. #1856

rxsalad commented May 4, 2024 •

edited

The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. #1856

The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. #1856

Comments

rxsalad commented May 4, 2024 • edited

System Info

Test with llama-3-8b-instruct (around 16 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM

Test with llama-3-8b-instruct-awq (around 6 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM

The questions are:

Information

Tasks

Reproduction

Expected behavior

rxsalad commented May 4, 2024 •

edited