Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing liveness probes can cause service degradation #14980

Open
ReToCode opened this issue Mar 5, 2024 · 4 comments
Open

Failing liveness probes can cause service degradation #14980

ReToCode opened this issue Mar 5, 2024 · 4 comments
Labels
area/networking kind/bug Categorizes issue or PR as related to a bug.

Comments

@ReToCode
Copy link
Member

ReToCode commented Mar 5, 2024

Description

Users are allowed to define a livenessProbe on their container (for now just on the main one, with #14853 also on all sidecars). If the user defines just a livenessProbe without also defining the same check as readinessProbe we can have the following situation:

  • Liveness probe starts failing, readiness probe stays ok (or our check to the TCP socket)
  • K8s will restart the affected container (not the entire pod)
  • Knative (QP) does not know about this situation and tries to forward requests to the local UC-port
  • The user sees an error (see below)
  • The situation resolves itself as soon as the restarted container is up and healthy again
# logs in QP
# Queue-Proxy tries to forward requests, but will error out:
queue-proxy {"severity":"ERROR","timestamp":"2024-01-18T14:36:05.753218441Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 8\nTCP: inuse 0 orphan 0 tw 15 alloc 185 mem 0\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"d96dabb-dirty","knative.dev/key":"default/runtime-00001","knative.dev/pod":"runtime-00001-deployment-7b9c49d676-dlxmt","error":"dial tcp 127.0.0.1:8080: connect: connection refused","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/pkg@v0.0.0-20240115132401-f95090a164db/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}
queue-proxy {"httpRequest": {"requestMethod": "GET", "requestUrl": "/", "requestSize": "0", "status": 502, "responseSize": "53", "userAgent": "curl/8.4.0", "remoteIp": "10.42.0.20:52782", "serverIp": "10.42.0.34", "referer": "", "latency": "0.001448584s", "protocol": "HTTP/1.1"}, "traceId": "]"}
queue-proxy {"httpRequest": {"requestMethod": "GET", "requestUrl": "/", "requestSize": "0", "status": 502, "responseSize": "53", "userAgent": "curl/8.4.0", "remoteIp": "10.42.0.20:52782", "serverIp": "10.42.0.34", "referer": "", "latency": "0.000335542s", "protocol": "HTTP/1.1"}, "traceId": "]"}
queue-proxy {"severity":"ERROR","timestamp":"2024-01-18T14:36:06.781777885Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 8\nTCP: inuse 0 orphan 0 tw 15 alloc 185 mem 0\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"d96dabb-dirty","knative.dev/key":"default/runtime-00001","knative.dev/pod":"runtime-00001-deployment-7b9c49d676-dlxmt","error":"dial tcp 127.0.0.1:8080: connect: connection refused","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/pkg@v0.0.0-20240115132401-f95090a164db/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}
queue-proxy {"httpRequest": {"requestMethod": "GET", "requestUrl": "/", "requestSize": "0", "status": 502, "responseSize": "53", "userAgent": "curl/8.4.0", "remoteIp": "10.42.0.20:52782", "serverIp": "10.42.0.34", "referer": "", "latency": "0.00029575s", "protocol": "HTTP/1.1"}, "traceId": "]"}
queue-proxy {"severity":"ERROR","timestamp":"2024-01-18T14:36:07.815057125Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 8\nTCP: inuse 0 orphan 0 tw 15 alloc 185 mem 0\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"d96dabb-dirty","knative.dev/key":"default/runtime-00001","knative.dev/pod":"runtime-00001-deployment-7b9c49d676-dlxmt","error":"dial tcp 127.0.0.1:8080: connect: connection refused","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/pkg@v0.0.0-20240115132401-f95090a164db/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}
queue-proxy {"httpRequest": {"requestMethod": "GET", "requestUrl": "/", "requestSize": "0", "status": 502, "responseSize": "53", "userAgent": "curl/8.4.0", "remoteIp": "10.42.0.20:52782", "serverIp": "10.42.0.34", "referer": "", "latency": "0.000280208s", "protocol": "HTTP/1.1"}, "traceId": "]"}
queue-proxy {"severity":"ERROR","timestamp":"2024-01-18T14:36:08.847556037Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 8\nTCP: inuse 0 orphan 0 tw 15 alloc 185 mem 0\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"d96dabb-dirty","knative.dev/key":"default/runtime-00001","knative.dev/pod":"runtime-00001-deployment-7b9c49d676-dlxmt","error":"dial tcp 127.0.0.1:8080: connect: connection refused","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/pkg@v0.0.0-20240115132401-f95090a164db/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}
queue-proxy {"severity":"ERROR","timestamp":"2024-01-18T14:36:09.897340832Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 8\nTCP: inuse 0 orphan 0 tw 15 alloc 185 mem 0\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"d96dabb-dirty","knative.dev/key":"default/runtime-00001","knative.dev/pod":"runtime-00001-deployment-7b9c49d676-dlxmt","error":"dial tcp 127.0.0.1:8080: connect: connection refused","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/pkg@v0.0.0-20240115132401-f95090a164db/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}
# The user sees
* Connection #0 to host runtime.default.172.17.0.100.sslip.io left intact
*   Trying 172.17.0.100:80...
* Connected to runtime.default.172.17.0.100.sslip.io (172.17.0.100) port 80
> GET / HTTP/1.1
> Host: runtime.default.172.17.0.100.sslip.io
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/1.1 502 Bad Gateway
HTTP/1.1 502 Bad Gateway
< content-length: 53
content-length: 53
< content-type: text/plain; charset=utf-8
content-type: text/plain; charset=utf-8
< date: Thu, 18 Jan 2024 14:36:09 GMT
date: Thu, 18 Jan 2024 14:36:09 GMT
< x-content-type-options: nosniff
x-content-type-options: nosniff
< x-envoy-upstream-service-time: 1
x-envoy-upstream-service-time: 1
< server: envoy
server: envoy

<
dial tcp 127.0.0.1:8080: connect: connection refused
@ReToCode ReToCode added kind/bug Categorizes issue or PR as related to a bug. area/networking labels Mar 5, 2024
@ckcd
Copy link
Contributor

ckcd commented Mar 8, 2024

@ReToCode when K8s restart the affected user container, QP's readinessProbe check should also failed because it inherits the origin readinessProbe configuration of user container, then the trafffic should not go to this pod, right? so I was thinking why did what you said happen. Please help to correct me if you are available.

@ReToCode
Copy link
Member Author

ReToCode commented Mar 8, 2024

Yes this is correct. When your readiness-probe also fails, this is not an issue. It only happens in a scenario when the liveness and readiness probes are two different probes, then liveness fails and K8s restarts the specific container (not Pod), thus the readiness could be healthy or take longer to also start failing. During that short period until the container is back up, traffic can error out.

@ckcd
Copy link
Contributor

ckcd commented Mar 8, 2024

@ReToCode Ok I got your point. So perhaps we need to do some deep control for livenessProbe, just like what we do to readinessProbe?

@mgencur
Copy link
Contributor

mgencur commented May 3, 2024

I suppose this problem can also occur when the LivenessProbe has a different PeriodSeconds or/and FailureThreshold from the ReadinessProbe. In this case, the Readiness probe might spot a failure at different times and not remove Endpoints from the Service quickly enough.
For sidecars, their Readiness probe PeriodSeconds/FailureThreshold are rewritten by values from the user container. So in the end the sidecar Liveness probes would have to have the same PeriodSeconds/FailureThreshold as the user container's ReadinessProbe so at to minimize this type of errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants