Istio-csr pods were hung unable to handle request causes entire cluster downtime for new pods/expired pods. #141

anannaya · 2022-03-25T07:45:42Z

We are running with cert-manager-istio-csr 2 pods ,

one pods was hung at

2022-03-24T18:59:47.311227Z	info	klog	tls-provider "msg"="renewing serving certificate"  
2022-03-24T18:59:47.556644Z	info	klog	tls-provider "msg"="serving certificate ready"
2022-03-24T18:59:47.556869Z	info	klog	tls-provider "msg"="fetched new serving certificate"  "expiry-time"="2022-03-25T18:59:47Z"
2022-03-24T18:59:47.556722Z	info	spiffe	Added 1 certs to trust domain cluster.local in peer cert verifier
2022-03-24T18:59:47.556896Z	info	klog	tls-provider "msg"="waiting to renew certificate"  "renewal-time"="2022-03-25T10:59:47.185630849Z"
2022-03-25T06:54:05.842512Z	info	klog	manager/events "msg"="Normal"  "message"="cert-manager-istio-csr-689dfdb989-9k9wj_113e526f-7327-43f1-a2c2-81715b0a5573 became leader" "object"={"kind":"ConfigMap","namespace":"istio-system","name":"istio-csr","uid":"8771468c-c206-4b9d-bf89-11f4f55e75b6","apiVersion":"v1","resourceVersion":"732068782"} "reason"="LeaderElection"
022-03-25T06:54:05.842772Z	info	klog	manager/controller/configmap "msg"="Starting EventSource" "reconciler group"="" "reconciler kind"="ConfigMap" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{}}}
2022-03-25T06:54:05.842630Z	info	klog	manager/controller/configmap "msg"="Starting EventSource" "reconciler group"="" "reconciler kind"="ConfigMap" "source"={"Type":{"kind":"ConfigMap","apiVersion":"v1","metadata":{"creationTimestamp":null}}}
2022-03-25T06:54:05.842545Z	info	klog	manager/events "msg"="Normal"  "message"="cert-manager-istio-csr-689dfdb989-9k9wj_113e526f-7327-43f1-a2c2-81715b0a5573 became leader" "object"={"kind":"Lease","namespace":"istio-system","name":"istio-csr","uid":"b6ba9c83-a619-43bb-8799-05a45f92781a","apiVersion":"coordination.k8s.io/v1","resourceVersion":"732068783"} "reason"="LeaderElection"
2022-03-25T06:54:05.842346Z	info	klog	successfully acquired lease istio-system/istio-csr

And other actual leader one hung at

2022-03-24T18:59:47.312223Z	info	klog	tls-provider "msg"="renewing serving certificate"
2022-03-24T18:59:47.760825Z	info	klog	tls-provider "msg"="fetched new serving certificate"  "expiry-time"="2022-03-25T18:59:47Z"
2022-03-24T18:59:47.760845Z	info	klog	tls-provider "msg"="waiting to renew certificate"  "renewal-time"="2022-03-25T10:59:47.253613531Z"
2022-03-24T18:59:47.760693Z	info	spiffe	Added 1 certs to trust domain cluster.local in peer cert verifier
2022-03-24T18:59:47.760627Z	info	klog	tls-provider "msg"="serving certificate ready"
2022-03-25T03:03:09.770135Z	error	klog	error retrieving resource lock istio-system/istio-csr: etcdserver: request timed out

Due to this no csr requests was handled for the pods , So all the pods were unable to come up.
pod error

{"level":"warn","time":"2022-03-25T06:50:28.724445Z","scope":"ca","msg":"ca request failed, starting attempt 1 in 101.313121ms"}
{"level":"warn","time":"2022-03-25T06:50:28.826778Z","scope":"ca","msg":"ca request failed, starting attempt 2 in 218.728544ms"}
{"level":"warn","time":"2022-03-25T06:50:29.046226Z","scope":"ca","msg":"ca request failed, starting attempt 3 in 365.315547ms"}
{"level":"warn","time":"2022-03-25T06:50:29.411679Z","scope":"ca","msg":"ca request failed, starting attempt 4 in 834.988896ms"}
{"level":"warn","time":"2022-03-25T06:50:30.247763Z","scope":"sds","msg":"failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2022-03-25T06:50:28Z is after 2022-03-25T03:59:47Z\""}

The text was updated successfully, but these errors were encountered:

anannaya · 2022-03-25T11:18:28Z

So almost 4hour istio-csr was not serving any request . Manual rolling restart of deployment solved the issue.

anannaya · 2022-03-25T11:25:33Z

After rolling restart new errror messages started to come

k logs -n istio-system cert-manager-istio-csr-5858ff84b-hpvwd
2022-03-25T06:53:56.765755Z	info	klog	 "msg"="-----------------------------------------------------------------"
2022-03-25T06:53:56.765787Z	info	klog	 "msg"="Using root CAs from file: /var/run/secrets/istio-csr/tls.crt"
2022-03-25T06:53:56.765799Z	info	klog	 "msg"="-----------------------------------------------------------------"
2022-03-25T06:53:56.795849Z	info	klog	attempting to acquire leader lease istio-system/istio-csr...
2022-03-25T06:53:56.795927Z	info	klog	tls-provider/root-ca-watcher "msg"="loading root CAs bundle" "file"="/var/run/secrets/istio-csr/tls.crt"
2022-03-25T06:53:56.795987Z	info	klog	manager "msg"="starting metrics server"  "path"="/metrics"
2022-03-25T06:53:56.796043Z	info	klog	tls-provider/root-ca-watcher "msg"="updating root CAs from file" "file"="/var/run/secrets/istio-csr/tls.crt"
2022-03-25T06:53:57.180152Z	info	klog	tls-provider "msg"="serving certificate ready"
2022-03-25T06:53:57.180217Z	info	spiffe	Added 1 certs to trust domain cluster.local in peer cert verifier
2022-03-25T06:53:57.180369Z	info	klog	tls-provider "msg"="fetched initial serving certificate"
2022-03-25T06:53:57.180411Z	info	klog	tls-provider "msg"="waiting to renew certificate"  "renewal-time"="2022-03-25T22:53:57.060128122Z"
2022-03-25T06:53:57.297049Z	info	klog	grpc-server "msg"="grpc serving" "serving-addr"="0.0.0.0:6443" "address"="[::]:6443"
2022-03-25T06:54:19.315736Z	error	klog	Failed to update lock: Operation cannot be fulfilled on configmaps "istio-csr": the object has been modified; please apply your changes to the latest version and try again
2022-03-25T09:23:41.509654Z	info	klog	Waited for 1.183671823s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/cert-manager.io/v1/namespaces/istio-system/certificaterequests/istio-csr-sdtmf
2022-03-25T10:10:02.886473Z	error	klog	grpc-server "msg"="failed to authenticate request" "error"="failed to validate the JWT from cluster \"loadus1-eks\": the service account authentication returns an error: [invalid bearer token, rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /etc/kubernetes/key-server/pipe.sock: connect: no such file or directory\", Post \"https://127.0.0.1:21362/authenticate?timeout=30s\": dial tcp 127.0.0.1:21362: connect: connection refused]" "serving-addr"="0.0.0.0:6443"
2022-03-25T10:10:02.990037Z	error	klog	grpc-server "msg"="failed to authenticate request" "error"="failed to validate the JWT from cluster \"loadus1-eks\": the service account authentication returns an error: [invalid bearer token, rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /etc/kubernetes/key-server/pipe.sock: connect: no such file or directory\", Post \"https://127.0.0.1:21362/authenticate?timeout=30s\": dial tcp 127.0.0.1:21362: connect: connection refused]" "serving-addr"="0.0.0.0:6443"

JoshVanL · 2022-03-25T13:14:57Z

Hi @anannaya,

Can you please share the istio-csr version and configuration you used. Sharing your kube and istio version will also help.

one pods was hung at

Can you please set the logging to debug (--v=5). It is not immediately clear to me that istio-csr is actually hanging here.

etcdserver: request timed out

This suggests that your etcd is under resourced. Are their other system components experiencing the same errors? What are the logs of the API server?

anannaya · 2022-03-25T13:40:59Z

istio-csr version is --> 0.3.0
kubernetes version --> v1.19.13-eks-8df270

This suggests that your etcd is under resourced. Are their other system components experiencing the same errors? What are the logs of the API server?
--> We use AWS EKS , i totally agree there might be intermediate control-plane network issues, but there should be retry ? OR other istio-csr pod should take care. kubectl commands was working, metrics everything was working as usual.

Now . I have upgraded the istio-csr to 0.4.0 , with a loglevel 5 .

Is the istio-csr used in production by anyone yet ?

ethompsy · 2022-08-09T15:41:12Z

Well, we were thinking about putting it in production until we saw this post. Was there any further resolution since March?

anannaya changed the title ~~Istio-csr pod was hung due to this Entire cluster downtime for new pods.~~ Istio-csr pods were unable to handle request causes entire cluster downtime for new pods/expired pods. Mar 25, 2022

anannaya changed the title ~~Istio-csr pods were unable to handle request causes entire cluster downtime for new pods/expired pods.~~ Istio-csr pods were hung unable to handle request causes entire cluster downtime for new pods/expired pods. Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Istio-csr pods were hung unable to handle request causes entire cluster downtime for new pods/expired pods. #141

Istio-csr pods were hung unable to handle request causes entire cluster downtime for new pods/expired pods. #141

anannaya commented Mar 25, 2022 •

edited

anannaya commented Mar 25, 2022 •

edited

anannaya commented Mar 25, 2022

JoshVanL commented Mar 25, 2022 •

edited

anannaya commented Mar 25, 2022 •

edited

ethompsy commented Aug 9, 2022

Istio-csr pods were hung unable to handle request causes entire cluster downtime for new pods/expired pods. #141

Istio-csr pods were hung unable to handle request causes entire cluster downtime for new pods/expired pods. #141

Comments

anannaya commented Mar 25, 2022 • edited

anannaya commented Mar 25, 2022 • edited

anannaya commented Mar 25, 2022

JoshVanL commented Mar 25, 2022 • edited

anannaya commented Mar 25, 2022 • edited

ethompsy commented Aug 9, 2022

anannaya commented Mar 25, 2022 •

edited

anannaya commented Mar 25, 2022 •

edited

JoshVanL commented Mar 25, 2022 •

edited

anannaya commented Mar 25, 2022 •

edited