Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1 GPU clusters can get stuck in a DEADLINE_EXCEEDED loop. #338

Open
willgraf opened this issue May 5, 2020 · 1 comment
Open

1 GPU clusters can get stuck in a DEADLINE_EXCEEDED loop. #338

willgraf opened this issue May 5, 2020 · 1 comment
Labels
bug Something isn't working

Comments

@willgraf
Copy link
Contributor

willgraf commented May 5, 2020

Describe the bug
Sometimes clusters with a single GPU can get stuck with too many consumers sending requests that are taking too long and getting rejected with a DEADLINE_EXCEEDED error. This can happen with the GPU at high usage, which is accounted for with the Prometheus scaling rule. However, in cases with a single GPU, we can get stuck with the GPU at 0% usage and yet all requests are timing out. I have not seen this in any cluster with > 1 GPU.

To Reproduce
I've seen this in 100k benchmarking runs, though it does not happen regularly.

Expected behavior
The consumers will scale down to allow for the GPU to start processing requests in a reasonable time.

This may be fixed with a better backoff on the consumer side, a more effective GRPC_TIMEOUT setting, or improvements to the scaling rule. This additionally may be resolved with improved metrics discussed in #278.

Screenshots
The HPA status where it can get stuck (tf-serving at 0 and segmentation-consumer at 1).
Screen Shot 2020-05-05 at 1 02 32 PM

Additional context
The TensorFlow Serving logs had some unusual warnings that may or not be related:

[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
2020-05-05 19:25:59.266451: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 268435456 exceeds 10% of system memory.
2020-05-05 19:25:59.386629: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 268435456 exceeds 10% of system memory.
2020-05-05 19:25:59.746314: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 536870912 exceeds 10% of system memory.
2020-05-05 19:26:00.081015: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 536870912 exceeds 10% of system memory.
2020-05-05 19:26:10.632290: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 268435456 exceeds 10% of system memory.
@willgraf willgraf added the bug Something isn't working label May 5, 2020
@willgraf
Copy link
Contributor Author

willgraf commented May 6, 2020

It seems you can just redeploy tf-serving as a workaround:

helm delete tf-serving --purge ; helmfile -l name=tf-serving sync

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant