Request for Improved Metrics and Real-Time Concurrency Reporting in Triton Inference Server #7145

hxer7963 · 2024-04-22T02:49:21Z

Is your feature request related to a problem? Please describe.
The current operational metrics provided by the Triton Inference Server's metrics endpoint are insufficient for our needs. Key load indicators, such as input token length, output token length, combined input and output token length, and real-time concurrency of the TensorRT LLM, are missing. This makes it challenging to assess the distribution of seqLen in online requests and analyze real-time system load accurately.

Describe the solution you'd like
I would like to see enhancements to the Triton Inference Server's metrics endpoint to include the following:

Reporting of input token length, output token length, and combined input and output token length.
Real-time concurrency reporting for the TensorRT LLM backend.

Additionally, I'm uncertain whether Triton Inference Server supports reporting input_token_len in the preprocessing Python backend and whether it allows for the aggregation of output_token_len at the end of each request in the postprocessing Python backend. Similarly, I'm unsure if TensorRT LLM supports exporting real-time concurrency to the metrics endpoint.

Describe alternatives you've considered
One alternative would be to implement custom logging and monitoring solutions to track the desired metrics externally. However, having these metrics integrated directly into the Triton Inference Server's metrics endpoint would streamline monitoring and analysis processes.

Additional context
We have deployed the Triton Inference Server along with the TensorRT LLM backend in our production environment, utilizing an ensemble pipeline. This includes the preprocessing stage with tokenizer, tensorRT LLM generation truncation, and the detokenization stage.

jbkyang-nvi added the enhancement New feature or request label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Improved Metrics and Real-Time Concurrency Reporting in Triton Inference Server #7145

Request for Improved Metrics and Real-Time Concurrency Reporting in Triton Inference Server #7145

hxer7963 commented Apr 22, 2024

Request for Improved Metrics and Real-Time Concurrency Reporting in Triton Inference Server #7145

Request for Improved Metrics and Real-Time Concurrency Reporting in Triton Inference Server #7145

Comments

hxer7963 commented Apr 22, 2024