Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Casting NumPy string array to np_utils.Tensor disproportionately increases latency #7153

Open
LLautenbacher opened this issue Apr 24, 2024 · 2 comments
Labels
module: backends Issues related to the backends question Further information is requested

Comments

@LLautenbacher
Copy link

LLautenbacher commented Apr 24, 2024

Description
Casting a NumPy string array to np_utils.Tensor using the python backend causes a disproportionate increase in latency (~300x).

Triton Information
nvcr.io/nvidia/tritonserver:23.05-py3
This also still happens in 24.03.

To Reproduce
When using the model and config below I get a latency of 9873 usec using the perf_analyzer. Uncommenting the line pb_utils.Tensor("annotation", arr_s) causes the latency to increase to 2888440 usec. Creating the NumPy array doesn't seem to matter only casting it to a tensor is what causes the slowdown.

model.py

import triton_python_backend_utils as pb_utils
import numpy as np
import json


class TritonPythonModel:
    def initialize(self, args):
        self.model_config = json.loads(args["model_config"])
        output0_config = pb_utils.get_output_config_by_name(
            self.model_config, "annotation"
        )
        self.output_dtype = pb_utils.triton_string_to_numpy(output0_config["data_type"])

    def execute(self, requests):
        responses = []
        for request in requests:
            batchsize = (
                pb_utils.get_input_tensor_by_name(request, "input0").as_numpy().shape[0]
            )
            arr_s = np.empty((batchsize, 256), dtype=np.dtype("S5"))
            arr_f = np.empty((batchsize, 256), dtype=np.dtype("float64"))
            # pb_utils.Tensor("annotation", arr_s)
            t = pb_utils.Tensor("annotation", arr_f)
            responses.append(pb_utils.InferenceResponse(output_tensors=[t]))
        return responses

    def finalize(self):
        pass
max_batch_size: 1000
input [
  {
    name: 'input0',
    data_type: TYPE_INT32,
    dims: [1],
  }
]
output [
 {
   name: 'annotation',
   data_type: TYPE_FP64,
   dims: [174]
 }
]

Expected behavior
Returning a string array shouldn't take 300x as long as a float array.

@rmccorm4
Copy link
Collaborator

rmccorm4 commented May 1, 2024

Hi @LLautenbacher, thanks for raising this issue with such detail.

@Tabrizian @krishung5 may be able to chime in here.

Is is possible this commented line is causing an extra copy? Also, can you elaborate on this datatype np.dtype("S5")? Is it required, and do you see different behavior if you use something like np.object_ instead?

@rmccorm4 rmccorm4 added question Further information is requested module: backends Issues related to the backends labels May 1, 2024
@LLautenbacher
Copy link
Author

Thank you for looking into this!

The specific string datatype is not relevant. U S and O all show this behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: backends Issues related to the backends question Further information is requested
Development

No branches or pull requests

2 participants