Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Improved Metrics and Real-Time Concurrency Reporting in Triton Inference Server #7145

Open
hxer7963 opened this issue Apr 22, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@hxer7963
Copy link

Is your feature request related to a problem? Please describe.
The current operational metrics provided by the Triton Inference Server's metrics endpoint are insufficient for our needs. Key load indicators, such as input token length, output token length, combined input and output token length, and real-time concurrency of the TensorRT LLM, are missing. This makes it challenging to assess the distribution of seqLen in online requests and analyze real-time system load accurately.

Describe the solution you'd like
I would like to see enhancements to the Triton Inference Server's metrics endpoint to include the following:

  1. Reporting of input token length, output token length, and combined input and output token length.
  2. Real-time concurrency reporting for the TensorRT LLM backend.

Additionally, I'm uncertain whether Triton Inference Server supports reporting input_token_len in the preprocessing Python backend and whether it allows for the aggregation of output_token_len at the end of each request in the postprocessing Python backend. Similarly, I'm unsure if TensorRT LLM supports exporting real-time concurrency to the metrics endpoint.

Describe alternatives you've considered
One alternative would be to implement custom logging and monitoring solutions to track the desired metrics externally. However, having these metrics integrated directly into the Triton Inference Server's metrics endpoint would streamline monitoring and analysis processes.

Additional context
We have deployed the Triton Inference Server along with the TensorRT LLM backend in our production environment, utilizing an ensemble pipeline. This includes the preprocessing stage with tokenizer, tensorRT LLM generation truncation, and the detokenization stage.

@jbkyang-nvi jbkyang-nvi added the enhancement New feature or request label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

2 participants