Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client silent failure - E0422 05:03:24.145960 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument #7148

Open
jrcavani opened this issue Apr 23, 2024 · 3 comments

Comments

@jrcavani
Copy link

jrcavani commented Apr 23, 2024

Description

The model repo is an object detection ensemble, which consists of a preprocessor written with the Python backend, and the main model in TensorRT plan. The Python backend uses CuPy to allocate GPU tensors, and pass back to Triton scheduler with pb_utils.Tensor.from_dlpack for the TensorRT model.

            inference_response = pb_utils.InferenceResponse(
                output_tensors=[
                    pb_utils.Tensor.from_dlpack(
                        self.output_name, preprocessed_full_batch.copy()
                    )
                ]
            )

The CuPy allocation during preprocessing looks like:

    def preprocess(self, batch):
        """
        batch is imgs in HWC uint8 BGR format.
        """
        import cupy as xp

        batch = xp.asarray(np.array(batch))  # uint8 array
        input_blob = batch.astype(xp.float32)  # type convert after to GPU

        input_blob = input_blob[..., ::-1]  # BGR to RGB
        input_blob = input_blob.transpose(0, 3, 1, 2)  # NHWC to NCHW
        input_blob -= self.input_mean
        input_blob /= self.input_std

        return input_blob

It works great, unless user submits a large input / big batch size that exceeds some cuda buffer limit. The error on the server side looks like

W0422 05:00:35.414218 1 memory.cc:212] Failed to allocate CUDA memory with byte size 78643200 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory
E0422 05:00:35.457498 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument
E0422 05:00:35.537644 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument
E0422 05:00:35.560513 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument

However, the client is not getting an error response/exception! I tried both http and grpc Python clients, and they behaved the same - there is no error, but the output tensors were incorrect. This silent error is very alarming, because the results would be garbage but pretend to be ok...

Triton Information
container 24.03 - tritonserver 2.44.0

Are you using the Triton container or did you build it yourself?
NGC container

To Reproduce
The description above should be clear... By matching text it looks like the error comes from this line:

https://github.com/triton-inference-server/python_backend/blob/r24.03/src/pb_stub.cc#L403-L404

It must have happened at or after pb_utils.Tensor.from_dlpack(). And how come an error within the Python backend is not forwarded to the client?

If I convert the CuPy array back to NumPy and load it the usual way, it works:

            inference_response = pb_utils.InferenceResponse(
                output_tensors=[
                    pb_utils.Tensor(
                        self.output_name, cp.asnumpy(preprocessed_full_batch[start_offset:end_offset])
                    )
                ]
            )

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

ensemble model config:

platform: "ensemble"
max_batch_size: 16

input [
  {
    name: "image"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]

output [
  {
    name: "score_8"
    data_type: TYPE_FP32
    dims: [ 12800, 1 ]
  },
  {
    name: "bbox_8"
    data_type: TYPE_FP32
    dims: [ 12800, 4 ]
  },
  {
    name: "kps_8"
    data_type: TYPE_FP32
    dims: [ 12800, 10 ]
  },
  {
    name: "score_16"
    data_type: TYPE_FP32
    dims: [ 3200, 1 ]
  },
  {
    name: "bbox_16"
    data_type: TYPE_FP32
    dims: [ 3200, 4 ]
  },
  {
    name: "kps_16"
    data_type: TYPE_FP32
    dims: [ 3200, 10 ]
  },
  {
    name: "score_32"
    data_type: TYPE_FP32
    dims: [ 800, 1 ]
  },
  {
    name: "bbox_32"
    data_type: TYPE_FP32
    dims: [ 800, 4 ]
  },
  {
    name: "kps_32"
    data_type: TYPE_FP32
    dims: [ 800, 10 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "detector_preprocessor"
      model_version: -1
      input_map {
        key: "image"
        value: "image"
      }
      output_map {
        key: "input.1"
        value: "input.1"
      }
    },
    {
      model_name: "detector_main_model"
      model_version: -1
      input_map {
        key: "input.1"
        value: "input.1"
      }
      output_map {
        key: "score_8"
        value: "score_8"
      }
      output_map {
        key: "bbox_8"
        value: "bbox_8"
      }
      output_map {
        key: "kps_8"
        value: "kps_8"
      }
      output_map {
        key: "score_16"
        value: "score_16"
      }
      output_map {
        key: "bbox_16"
        value: "bbox_16"
      }
      output_map {
        key: "kps_16"
        value: "kps_16"
      }
      output_map {
        key: "score_32"
        value: "score_32"
      }
      output_map {
        key: "bbox_32"
        value: "bbox_32"
      }
      output_map {
        key: "kps_32"
        value: "kps_32"
      }
    }
  ]
}

Expected behavior

The server side error that invalidates the output should be a glaring error on the client side.

In addition, I would love to get some clarity on how cuda-memory-pool-byte-size is used when GPU tensors are queued from one model to another. What's the max queue size, and do all queued tensors take up space for this globally shared cuda-memory-pool-byte-size?

@jbkyang-nvi
Copy link
Contributor

Hello while we try to reproduce your issue, can you update your client + server to Triton 24.03? 23.04 is 1 year old and we don't really maintain containers that old.

@jbkyang-nvi
Copy link
Contributor

for cuda-memory-pool-byte-size is per GPU. As per the tritonserver cli:

The total byte size that can be allocated as CUDA memory for the GPU "
"device. If GPU support is enabled, the server will allocate CUDA "
"memory to minimize data transfer between host and devices until it "
"exceeds the specified byte size. This option will not affect the "
"allocation conducted by the backend frameworks. "

The "queued tensors" will take up space for all models

@jrcavani
Copy link
Author

jrcavani commented May 2, 2024

@jbkyang-nvi I am indeed using 24.03, the latest container.

On cuda-memory-pool-byte-size, it is exactly this last sentence that confused me:

This option will not affect the "allocation conducted by the backend frameworks. "

It sounds like cuda-memory-pool-byte-size only affects client -> server cuda shared memory when client and server are the same host, and the backend (in this case Python) allocation is not affected by this option.

But when I increased this value to 2GB, no errors were reported anymore. So maybe the correct understanding is it affects the pool size of tensors passed between backends, and it does not affect however the backend code allocates GPU memory, such as using CuPy to allocate arrays in the Python backend. Is this right?

It's easy to get confused because CUDA pool and pinned memory pool are used between client and server, at the same time, between backends in an ensemble.

The issue still exists imo, as this essentially obviated the problem - I'm sure if server errors again, client will still get the silent treatment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants