Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with multiple GPU and BLS. #7190

Open
kbegiedza opened this issue May 7, 2024 · 1 comment
Open

Memory leak with multiple GPU and BLS. #7190

kbegiedza opened this issue May 7, 2024 · 1 comment

Comments

@kbegiedza
Copy link

Description

I have multiple GPUs and a single Triton server's pod running inside Kubernetes cluster with multiple models including BLS and TensorRT's engine models.

When my models are running on node with single GPU there is no issue at all, but adding additional GPU results in slowly increasing memory.

I also observed rapidly increasing memory while using two GPUs, but only on first one (take a look at chart below).

image

Triton Information

Tested with official images:

  • nvcr.io/nvidia/tritonserver:23.12-py3
  • nvcr.io/nvidia/tritonserver:24.04-py3

To Reproduce

My sample bls look like one below:

class TritonPythonModel:
    def __init__(self) -> None:
        self.__bbox_service = BBoxService()
        self.__image_service = ImageService()

    async def execute(self, requests):
        responses = []

        for request in requests:
            image_triton = pb_utils.get_input_tensor_by_name(request, "IMAGES")
            image_tensor = from_dlpack(image_triton.to_dlpack()).cuda().to(torch.float32)  # type: ignore

            image_tensor = self.__image_service.reverse_last_channel(image_tensor)

            preprocessed_image_tensor = self.__preprocess(image_tensor, _PREPROCESS_LETTERBOX_SIZE).to(torch.float16)

            inference_request_input = pb_utils.Tensor.from_dlpack("images", to_dlpack(preprocessed_image_tensor))  # type: ignore
            inference_request = pb_utils.InferenceRequest(  # type: ignore
                model_name="__model:0",
                requested_output_names=["output0"],
                inputs=[inference_request_input],
            )

            inference_response = await inference_request.async_exec()

            prediction_triton = pb_utils.get_output_tensor_by_name(inference_response, name="output0")
            prediction_tensor = from_dlpack(prediction_triton.to_dlpack())  # type: ignore

            bboxes_tensor = self.__postprocess(prediction_tensor, image_tensor.shape, _PREPROCESS_LETTERBOX_SIZE)
            bboxes_tensor = bboxes_tensor.contiguous()

            bboxes_triton = pb_utils.Tensor.from_dlpack("BBOXES", to_dlpack(bboxes_tensor.to(torch.float16)))  # type: ignore
            inference_response = pb_utils.InferenceResponse(output_tensors=[bboxes_triton])  # type: ignore
            responses.append(inference_response)

        return responses

Wondering if those cuda() or device="cuda" used inside my preprocess / image service can raise issues while running on multiple GPUs.

Expected behavior
No memory leak and proper requests load balance.

@Tabrizian
Copy link
Member

@kbegiedza Thanks for reporting this issue. Can you share the code for BBoxService and ImageService as well so that we can repro this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants