New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mistral7b takes 4 times its size in VRAM on A100 #1863
Comments
Could you please explain why this happens? @OlivierDehaene @Narsil Thank you! |
Update: Apologies, I misread the question & assumed kv cache was already taken into account. See: #1863 (comment) |
Hello @Venkat2811 , I do not have access to the logs right now. But I am confident no other processes were running on that GPU. The GPU was reserved for serving TGI. I think I found what is "the problem". Basically, when TGI warms up the model it allocates memory for the KV cache so, those 67 GB RAM are coming from the 14GB of Mistral7B and the remaining are the GB allocated for KV Cache. This is the piece of code I refer to, from
However, I am missing a couple things:
|
Hey @martinigoyanes
Yes, as discussed in another threads, KV cache is essential in inference and without it, there is no inference. So there is no problem here.
Would you consider closing this issue as it's no longer an issue ?
Regarding your questions, I don't know CUDA, but my high level understanding is that it's related to efficient GPU computation (thread block, warp, thread). Depending on underlying hardware architecture, and model architecture, |
Thank you so much for your reply! I will close the issue and rely on the discussion #1897 |
Environment Setup
Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: c38a7d7
Docker label: sha-6c4496a
Kubernetes Cluster deployment
1 A100 GPU with 80GB RAM
12 CPU with 32 GB RAM
TGI version: 2.0.0
TGI Parameters:
MAX_INPUT_LENGTH: "8000"
MAX_TOTAL_TOKENS: "8512"
MAX_CONCURRENT_REQUESTS: "128"
LOG_LEVEL: "INFO"
MAX_BATCH_TOTAL_TOKENS: "4294967295"
WAITING_SERVED_RATIO: "0.3"
MAX_WAITING_TOKENS: "0"
MAX_BATCH_PREFILL_TOKENS: "32768"
Question
I am trying to run Mistral7b on A100GB but I see it taking up 67GB of VRAM.
Why is that? Mistral7b is a 7b param model, and with 16-bit precision that should lead to 14GB of VRAM to store the model, right?
What are the extra 53815MB in VRAM coming from?
What seems to be the issue? Am I doing something wrong?
The text was updated successfully, but these errors were encountered: