You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi guys, I have some question regarding. I have 256 GB VRAM of Tesla V100 32GB. I deploy model from TheBloke/Llama-2-70B-Chat-GPTQ. I set the dtype = float16. And then when i monitor the gpu usage, I can see that each of the layer take around 17GB per layer. My question is, it is supposed that when i load the model, it will take around 40GB total (model size) out of 256GB, but when I load it, the GPU usage is more than that. Why this is happened?
The text was updated successfully, but these errors were encountered:
Anything you want to discuss about vllm.
Hi guys, I have some question regarding. I have 256 GB VRAM of Tesla V100 32GB. I deploy model from TheBloke/Llama-2-70B-Chat-GPTQ. I set the dtype = float16. And then when i monitor the gpu usage, I can see that each of the layer take around 17GB per layer. My question is, it is supposed that when i load the model, it will take around 40GB total (model size) out of 256GB, but when I load it, the GPU usage is more than that. Why this is happened?
The text was updated successfully, but these errors were encountered: