Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify custom prometheus metrics with gRPC request metrics #278

Open
willgraf opened this issue Feb 27, 2020 · 2 comments
Open

Simplify custom prometheus metrics with gRPC request metrics #278

willgraf opened this issue Feb 27, 2020 · 2 comments
Labels
enhancement New feature or request

Comments

@willgraf
Copy link
Contributor

willgraf commented Feb 27, 2020

Is your feature request related to a problem? Please describe.
Our segmentation-consumer custom metric is fairly complicated. The metric tries to strike a balance between consumers and tf=serving pods in order to throttle the number of requests sent to any given tf-serving pod. If any tf-serving pod gets too much traffic, it can go down, which we want to avoid. If we could directly measure the number of requests-per-second the tf-serving pods are getting, the metric could become much more simplified, and more directly reflect the scaling goal.

Describe the solution you'd like
I would like to refactor our custom metrics to utilize gRPC API requests to the tf-serving, instead of the pod ratio. Istio is a tool which is supposed to be able to measure gRPC API requests. Istio should be installed and integrated with prometheus in order to get a better more responsive.

Describe alternatives you've considered
The root cause is that the prometheus-adapter does not measure grpc requests over the network. The goal is to have a prometheus metric that measures the number (and size) of gRPC requests to the tf-serving pods. https://github.com/ynqa/tf_serving_exporter looks promising.

Additional context
This issue was looked into previously, but there was an issue having requests get routed over the Istio service mesh. Perhaps newer releases of Istio have resolved this issue.

@willgraf
Copy link
Contributor Author

I noticed a flag for cluster creation that may help with Istio:

--enable-intra-node-visibility

Some other flags that may be useful, but I haven't looked into enough:

--enable-shielded-nodes?
--shielded-integrity-monitoring
--shielded-secure-boot
--enable-resource-consumption-metering?
--enable-network-egress-metering?
--enable-tpu

@willgraf
Copy link
Contributor Author

Some other half-baked ideas I had for scaling using Prometheus:


rate(:tensorflow:core:graph_run_input_tensor_bytes_count[1m])

rate(:tensorflow:core:graph_run_input_tensor_bytes_sum[1m])

redis_up == 1 if redis available else 0

:tensorflow:core:graph_run_time_usecs / 1000000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant