You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have multiple GPUs and a single Triton server's pod running inside Kubernetes cluster with multiple models including BLS and TensorRT's engine models.
When my models are running on node with single GPU there is no issue at all, but adding additional GPU results in slowly increasing memory.
I also observed rapidly increasing memory while using two GPUs, but only on first one (take a look at chart below).
Description
I have multiple GPUs and a single Triton server's pod running inside Kubernetes cluster with multiple models including BLS and TensorRT's engine models.
When my models are running on node with single GPU there is no issue at all, but adding additional GPU results in slowly increasing memory.
I also observed rapidly increasing memory while using two GPUs, but only on first one (take a look at chart below).
Triton Information
Tested with official images:
nvcr.io/nvidia/tritonserver:23.12-py3
nvcr.io/nvidia/tritonserver:24.04-py3
To Reproduce
My sample bls look like one below:
Wondering if those
cuda()
ordevice="cuda"
used inside mypreprocess
/image service
can raise issues while running on multiple GPUs.Expected behavior
No memory leak and proper requests load balance.
The text was updated successfully, but these errors were encountered: