Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi instance a model in GPU does not increase the throughput in Triton. #7108

Open
ign4si opened this issue Apr 12, 2024 · 1 comment
Open

Comments

@ign4si
Copy link

ign4si commented Apr 12, 2024

Description
Multi-instantiating a model in a GPU does not increase the efficiency when requesting from two different threads.
Triton Information
+----------------------------------+--------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+--------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.42.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_con |
| | figuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logg |
| | ing |
| model_repository_path[0] | /models |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+--------------------------------------------------------------------------------------------------------+

To launch the server I use the following command

docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:24.01-py3 tritonserver --model-repository=/models

These are my GPU specs
image

To Reproduce
I use a ResNet50 model, this is the configuration file I use.

      name: "resnet50"
      platform: "pytorch_libtorch"
      max_batch_size : 4
      input [
        {
          name: "input__0"
          data_type: TYPE_FP32
          dims: [ 3, 224, 224 ]
        }
      ]
      output [
        {
          name: "output__0"
          data_type: TYPE_FP32
          dims: [ 1000 ,1, 1] 
        }
      ]
      
      instance_group [
        {
          count: 4
          kind: KIND_GPU
          gpus: [ 0 ]
        }
      ]
      dynamic_batching {
      }

I wanted to send batches with size equal to max_batch_size, so I am sure that when two batches arrive at the same time they will be sent by the server to two differents instances of the model.

For running my script I use the following code:

import numpy as np
import tritonclient.http as httpclient
from PIL import Image
from torchvision import transforms
from tritonclient.utils import triton_to_np_dtype
import time
import torch
# preprocessing function
def rn50_preprocess(img_path="img1.jpg"):
    img = Image.open(img_path)
    preprocess = transforms.Compose(
        [
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ]
    )
    return preprocess(img).numpy()


transformed_img = rn50_preprocess()
transformed_img = np.expand_dims(transformed_img, axis=0)
transformed_img=np.concatenate((transformed_img,transformed_img),axis=0)
transformed_img=np.concatenate((transformed_img,transformed_img),axis=0) #just to reach the max_batch_size

# Setting up client
client = httpclient.InferenceServerClient(url="localhost:8000")

inputs = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32")
inputs.set_data_from_numpy(transformed_img, binary_data=True)

outputs = httpclient.InferRequestedOutput(
    "output__0", binary_data=True, class_count=1000
)
starting_time = time.time()
# Querying the server
for i in range(1000):
    start = time.time()
    results = client.infer(model_name="resnet50", inputs=[inputs], outputs=[outputs])
    end=time.time()
    print(f"Time taken for inference {(end-start)*1000}")

First, I instantiate just ONE (1) model on my GPU. Then I run my python script and I receive a response in aprox 11ms. When I run the second thread while the first one is running, the response time increases, which makes sense since the server is receiving two request and has just one instance to process.

Then I do the same experiment, but I instantiate more models on my GPU. Then, when I instantiate two models, I expect the server to handle the requests and send them to the model instance that is free for inferencing. Here, I anticipate a reduction in the time it takes to send requests from two threads. However, the average response time remains the same as with one model instance. I attached a plot from the results, showing the response time for the first thread. The jump in the response time is due to the start of the second thread. As you can see, the response time is the same regardless of the number of instances of the model, which does not make sense to me. What could be happening?

image

@decadance-dance
Copy link

I think we faced a similar issue. #7075

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants