Simplify custom prometheus metrics with gRPC request metrics #278

willgraf · 2020-02-27T18:53:54Z

Is your feature request related to a problem? Please describe.
Our segmentation-consumer custom metric is fairly complicated. The metric tries to strike a balance between consumers and tf=serving pods in order to throttle the number of requests sent to any given tf-serving pod. If any tf-serving pod gets too much traffic, it can go down, which we want to avoid. If we could directly measure the number of requests-per-second the tf-serving pods are getting, the metric could become much more simplified, and more directly reflect the scaling goal.

Describe the solution you'd like
I would like to refactor our custom metrics to utilize gRPC API requests to the tf-serving, instead of the pod ratio. Istio is a tool which is supposed to be able to measure gRPC API requests. Istio should be installed and integrated with prometheus in order to get a better more responsive.

Describe alternatives you've considered
The root cause is that the prometheus-adapter does not measure grpc requests over the network. The goal is to have a prometheus metric that measures the number (and size) of gRPC requests to the tf-serving pods. https://github.com/ynqa/tf_serving_exporter looks promising.

Additional context
This issue was looked into previously, but there was an issue having requests get routed over the Istio service mesh. Perhaps newer releases of Istio have resolved this issue.

The text was updated successfully, but these errors were encountered:

willgraf · 2020-05-24T19:44:43Z

I noticed a flag for cluster creation that may help with Istio:

--enable-intra-node-visibility

Some other flags that may be useful, but I haven't looked into enough:

--enable-shielded-nodes?
--shielded-integrity-monitoring
--shielded-secure-boot
--enable-resource-consumption-metering?
--enable-network-egress-metering?
--enable-tpu

willgraf · 2020-05-24T19:50:01Z

Some other half-baked ideas I had for scaling using Prometheus:


rate(:tensorflow:core:graph_run_input_tensor_bytes_count[1m])

rate(:tensorflow:core:graph_run_input_tensor_bytes_sum[1m])

redis_up == 1 if redis available else 0

:tensorflow:core:graph_run_time_usecs / 1000000

willgraf added the enhancement New feature or request label Feb 27, 2020

willgraf mentioned this issue May 5, 2020

1 GPU clusters can get stuck in a DEADLINE_EXCEEDED loop. #338

Open

willgraf mentioned this issue Jun 9, 2020

Optimizing Tf-Serving/Redis-Consumer Interaction #286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify custom prometheus metrics with gRPC request metrics #278

Simplify custom prometheus metrics with gRPC request metrics #278

willgraf commented Feb 27, 2020 •

edited

willgraf commented May 24, 2020

willgraf commented May 24, 2020

Simplify custom prometheus metrics with gRPC request metrics #278

Simplify custom prometheus metrics with gRPC request metrics #278

Comments

willgraf commented Feb 27, 2020 • edited

willgraf commented May 24, 2020

willgraf commented May 24, 2020

willgraf commented Feb 27, 2020 •

edited