Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. #1856

Open
2 of 4 tasks
rxsalad opened this issue May 4, 2024 · 0 comments

Comments

@rxsalad
Copy link

rxsalad commented May 4, 2024

System Info

Test with llama-3-8b-instruct (around 16 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM

Initial VRAM usage: 0.7 GB
The VRAM usage after the model is loaded: 21.1 GB, setting max batch total tokens to 38928
The VRAM usage after the batched inference, 16 x ( Prompt 512, Decode 512 ): 22.5 GB
The VRAM usage after the batched inference, 24 x ( Prompt 512, Decode 512 ): 23.8 GB
So it can support 24 x ( Prompt 512, Decode 512 ).

Test with llama-3-8b-instruct-awq (around 6 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM

Initial VRAM usage: 0.7 GB
The VRAM usage after the model is loaded: 19.2 GB, setting max batch total tokens to 104384
The VRAM usage after the batched inference, 8 x ( Prompt 512, Decode 512 ): 23.2 GB
The VRAM usage after the batched inference, 16 x ( Prompt 512, Decode 512 ): 24.5 GB
So it can only support 8 x ( Prompt 512, Decode 512 ).

The questions are:

1)Why TGI reserves a significant amount of VRAM and what it is used for?
2)Why does the VRAM usage keep growing after TGI reserves a large amount of VRAM during the inference?

It's hard to believe that the quantized model saves almost 10 GB in size compared to the standard Llama 3 8B, yet can handle much less batch requests on the same hardware.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

model=casperhansen/llama-3-8b-instruct-awq
text-generation-launcher --quantize awq --model-id $model

model=meta-llama/Meta-Llama-3-8B-Instruct
text-generation-launcher --model-id $model

text-generation-benchmark --tokenizer-name $model --batch-size 1 --batch-size 2 --batch-size 4 --batch-size 8 --batch-size 16 --batch-size 24 --sequence-length 512 --decode-length 512

Expected behavior

The quantized model should be able to support larger batch requests compared to the standard Llama 3 8B, given its smaller size ( 6GB vs. 16 GB).

@rxsalad rxsalad changed the title The quantized llama-3-8b-instruct-awq (around 6 GB in size) with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct (around 16 GB in size) with TGI 1.4 on the same RTX 3090 with 24GB VRAM. The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant