[Question] What is the recommended way to run Triton ? #5981

MatthieuToulemont · 2023-06-22T16:12:57Z

MatthieuToulemont
Jun 22, 2023

I am running Triton behind a reverse proxy and flask.
I am mostly curious about how other people run triton in practice and also what is the recommended way from Nvidia ?

For context we process tens of millions of images per day at a sub second latency on traditional vision tasks and diffusion related tasks

Answered by tanmayv25

Jun 26, 2023

Does each instance of InferenceServerClient reuses Stubs / channels ?

For Python gRPC, each instance of InferenceServerClient creates a new channel. It does not reuse the channel.

I am serving multiple models per Triton container, does each instance of InferenceServerClient have a unique stub / channel / connection per model or does it create a new stub / channel / connection per model at each request.

There is a single channel connection per InferenceServerClient. The same channel is used for all requests (all model infer requests and also non-infer requests are pushed through the same channel)

We don't have a specific recommendation to follow. However, our C++ client library is more…

View full answer

kthui · 2023-06-23T00:19:23Z

kthui
Jun 23, 2023
Collaborator

cc @GuanLuo @tanmayv25 on "how other people run triton in practice and also what is the recommended way".

0 replies

tanmayv25 · 2023-06-23T01:59:09Z

tanmayv25
Jun 23, 2023
Maintainer

@MatthieuTPHR Can you elaborate more on this? Triton has a very wide adoption. Depending upon the use-case, customer use Triton as an inference microservice. For this Triton gRPC and HTTP end-points can be utilized. The users can also write their own service/application and link it to Triton shared library using C API interface.
More info here: https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md

If you would like to learn more about the optimization then you can consult:

8 replies

MatthieuToulemont Jun 24, 2023
Author

Hello,

Does this happen when there are large number of inflight requests on the server?

Indeed, this happens when there is a surge of requests to the server.

Are you using Triton python HTTP or gRPC clients?

We are indeed using the gRPC client. In our configuration we have 15 flask workers each running an instance of tritonclient.grpc.InferenceServerClient and use the default parameters

Does each instance of InferenceServerClient reuses Stubs / channels ?
I am serving multiple models per Triton container, does each instance of InferenceServerClient have a unique stub / channel / connection per model or does it create a new stub / channel / connection per model at each request.

It looks like having parallel flask workers sending requests to Triton is a "good way" to reach the maximum limit of concurrent streams and that a framework like FastAPI might more suited as it has only one worker but I might be wrong.

Overall would you recommend gRPC or HTTP ?

What configuration would you (Nvidia) recommend in the case where we want to serve multiple models per Triton container at a very high number of requests per second ?

sourabh-burnwal Jun 24, 2023

@tanmayv25, my use-case involved flask only for admin usage like model state maintenance, and health checks. I was not routing inference requests through flask like @MatthieuTPHR is doing.

MatthieuToulemont Jun 24, 2023
Author

@sourabh-burnwal thank you for your help ! If I may ask, how do you usually route requests to Triton ?

tanmayv25 Jun 26, 2023
Maintainer

Does each instance of InferenceServerClient reuses Stubs / channels ?

For Python gRPC, each instance of InferenceServerClient creates a new channel. It does not reuse the channel.

I am serving multiple models per Triton container, does each instance of InferenceServerClient have a unique stub / channel / connection per model or does it create a new stub / channel / connection per model at each request.

There is a single channel connection per InferenceServerClient. The same channel is used for all requests (all model infer requests and also non-infer requests are pushed through the same channel)

We don't have a specific recommendation to follow. However, our C++ client library is more tuned for performance. The C++ gRPC does re-use the channel.

Answer selected by MatthieuToulemont

MatthieuToulemont Jul 6, 2023
Author

Thank you very much !

ken-kuro Feb 5, 2024

Hi there @MatthieuToulemont, I'm trying to setup an FastAPI/Flask to make inference request to Triton Inference Server, would you mind share some of your experiences on this, especially on when it comes to a surge of requests to the server? Have a nice day

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] What is the recommended way to run Triton ? #5981

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[Question] What is the recommended way to run Triton ? #5981

MatthieuToulemont Jun 22, 2023

Replies: 2 comments · 8 replies

kthui Jun 23, 2023 Collaborator

tanmayv25 Jun 23, 2023 Maintainer

MatthieuToulemont Jun 24, 2023 Author

sourabh-burnwal Jun 24, 2023

MatthieuToulemont Jun 24, 2023 Author

tanmayv25 Jun 26, 2023 Maintainer

MatthieuToulemont Jul 6, 2023 Author

ken-kuro Feb 5, 2024

MatthieuToulemont
Jun 22, 2023

Replies: 2 comments 8 replies

kthui
Jun 23, 2023
Collaborator

tanmayv25
Jun 23, 2023
Maintainer

MatthieuToulemont Jun 24, 2023
Author

MatthieuToulemont Jun 24, 2023
Author

tanmayv25 Jun 26, 2023
Maintainer

MatthieuToulemont Jul 6, 2023
Author