You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The current operational metrics provided by the Triton Inference Server's metrics endpoint are insufficient for our needs. Key load indicators, such as input token length, output token length, combined input and output token length, and real-time concurrency of the TensorRT LLM, are missing. This makes it challenging to assess the distribution of seqLen in online requests and analyze real-time system load accurately.
Describe the solution you'd like
I would like to see enhancements to the Triton Inference Server's metrics endpoint to include the following:
Reporting of input token length, output token length, and combined input and output token length.
Real-time concurrency reporting for the TensorRT LLM backend.
Additionally, I'm uncertain whether Triton Inference Server supports reporting input_token_len in the preprocessing Python backend and whether it allows for the aggregation of output_token_len at the end of each request in the postprocessing Python backend. Similarly, I'm unsure if TensorRT LLM supports exporting real-time concurrency to the metrics endpoint.
Describe alternatives you've considered
One alternative would be to implement custom logging and monitoring solutions to track the desired metrics externally. However, having these metrics integrated directly into the Triton Inference Server's metrics endpoint would streamline monitoring and analysis processes.
Additional context
We have deployed the Triton Inference Server along with the TensorRT LLM backend in our production environment, utilizing an ensemble pipeline. This includes the preprocessing stage with tokenizer, tensorRT LLM generation truncation, and the detokenization stage.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
The current operational metrics provided by the Triton Inference Server's metrics endpoint are insufficient for our needs. Key load indicators, such as input token length, output token length, combined input and output token length, and real-time concurrency of the TensorRT LLM, are missing. This makes it challenging to assess the distribution of seqLen in online requests and analyze real-time system load accurately.
Describe the solution you'd like
I would like to see enhancements to the Triton Inference Server's metrics endpoint to include the following:
Additionally, I'm uncertain whether Triton Inference Server supports reporting input_token_len in the preprocessing Python backend and whether it allows for the aggregation of output_token_len at the end of each request in the postprocessing Python backend. Similarly, I'm unsure if TensorRT LLM supports exporting real-time concurrency to the metrics endpoint.
Describe alternatives you've considered
One alternative would be to implement custom logging and monitoring solutions to track the desired metrics externally. However, having these metrics integrated directly into the Triton Inference Server's metrics endpoint would streamline monitoring and analysis processes.
Additional context
We have deployed the Triton Inference Server along with the TensorRT LLM backend in our production environment, utilizing an ensemble pipeline. This includes the preprocessing stage with tokenizer, tensorRT LLM generation truncation, and the detokenization stage.
The text was updated successfully, but these errors were encountered: