Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Istio-csr pods were hung unable to handle request causes entire cluster downtime for new pods/expired pods. #141

Open
anannaya opened this issue Mar 25, 2022 · 5 comments

Comments

@anannaya
Copy link

anannaya commented Mar 25, 2022

We are running with cert-manager-istio-csr 2 pods ,

one pods was hung at

2022-03-24T18:59:47.311227Z	info	klog	tls-provider "msg"="renewing serving certificate"  
2022-03-24T18:59:47.556644Z	info	klog	tls-provider "msg"="serving certificate ready"
2022-03-24T18:59:47.556869Z	info	klog	tls-provider "msg"="fetched new serving certificate"  "expiry-time"="2022-03-25T18:59:47Z"
2022-03-24T18:59:47.556722Z	info	spiffe	Added 1 certs to trust domain cluster.local in peer cert verifier
2022-03-24T18:59:47.556896Z	info	klog	tls-provider "msg"="waiting to renew certificate"  "renewal-time"="2022-03-25T10:59:47.185630849Z"
2022-03-25T06:54:05.842512Z	info	klog	manager/events "msg"="Normal"  "message"="cert-manager-istio-csr-689dfdb989-9k9wj_113e526f-7327-43f1-a2c2-81715b0a5573 became leader" "object"={"kind":"ConfigMap","namespace":"istio-system","name":"istio-csr","uid":"8771468c-c206-4b9d-bf89-11f4f55e75b6","apiVersion":"v1","resourceVersion":"732068782"} "reason"="LeaderElection"
022-03-25T06:54:05.842772Z	info	klog	manager/controller/configmap "msg"="Starting EventSource" "reconciler group"="" "reconciler kind"="ConfigMap" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{}}}
2022-03-25T06:54:05.842630Z	info	klog	manager/controller/configmap "msg"="Starting EventSource" "reconciler group"="" "reconciler kind"="ConfigMap" "source"={"Type":{"kind":"ConfigMap","apiVersion":"v1","metadata":{"creationTimestamp":null}}}
2022-03-25T06:54:05.842545Z	info	klog	manager/events "msg"="Normal"  "message"="cert-manager-istio-csr-689dfdb989-9k9wj_113e526f-7327-43f1-a2c2-81715b0a5573 became leader" "object"={"kind":"Lease","namespace":"istio-system","name":"istio-csr","uid":"b6ba9c83-a619-43bb-8799-05a45f92781a","apiVersion":"coordination.k8s.io/v1","resourceVersion":"732068783"} "reason"="LeaderElection"
2022-03-25T06:54:05.842346Z	info	klog	successfully acquired lease istio-system/istio-csr

And other actual leader one hung at

2022-03-24T18:59:47.312223Z	info	klog	tls-provider "msg"="renewing serving certificate"
2022-03-24T18:59:47.760825Z	info	klog	tls-provider "msg"="fetched new serving certificate"  "expiry-time"="2022-03-25T18:59:47Z"
2022-03-24T18:59:47.760845Z	info	klog	tls-provider "msg"="waiting to renew certificate"  "renewal-time"="2022-03-25T10:59:47.253613531Z"
2022-03-24T18:59:47.760693Z	info	spiffe	Added 1 certs to trust domain cluster.local in peer cert verifier
2022-03-24T18:59:47.760627Z	info	klog	tls-provider "msg"="serving certificate ready"
2022-03-25T03:03:09.770135Z	error	klog	error retrieving resource lock istio-system/istio-csr: etcdserver: request timed out

Due to this no csr requests was handled for the pods , So all the pods were unable to come up.
pod error

{"level":"warn","time":"2022-03-25T06:50:28.724445Z","scope":"ca","msg":"ca request failed, starting attempt 1 in 101.313121ms"}
{"level":"warn","time":"2022-03-25T06:50:28.826778Z","scope":"ca","msg":"ca request failed, starting attempt 2 in 218.728544ms"}
{"level":"warn","time":"2022-03-25T06:50:29.046226Z","scope":"ca","msg":"ca request failed, starting attempt 3 in 365.315547ms"}
{"level":"warn","time":"2022-03-25T06:50:29.411679Z","scope":"ca","msg":"ca request failed, starting attempt 4 in 834.988896ms"}
{"level":"warn","time":"2022-03-25T06:50:30.247763Z","scope":"sds","msg":"failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2022-03-25T06:50:28Z is after 2022-03-25T03:59:47Z\""}
@anannaya
Copy link
Author

anannaya commented Mar 25, 2022

So almost 4hour istio-csr was not serving any request . Manual rolling restart of deployment solved the issue.

@anannaya
Copy link
Author

After rolling restart new errror messages started to come

k logs -n istio-system cert-manager-istio-csr-5858ff84b-hpvwd
2022-03-25T06:53:56.765755Z	info	klog	 "msg"="-----------------------------------------------------------------"
2022-03-25T06:53:56.765787Z	info	klog	 "msg"="Using root CAs from file: /var/run/secrets/istio-csr/tls.crt"
2022-03-25T06:53:56.765799Z	info	klog	 "msg"="-----------------------------------------------------------------"
2022-03-25T06:53:56.795849Z	info	klog	attempting to acquire leader lease istio-system/istio-csr...
2022-03-25T06:53:56.795927Z	info	klog	tls-provider/root-ca-watcher "msg"="loading root CAs bundle" "file"="/var/run/secrets/istio-csr/tls.crt"
2022-03-25T06:53:56.795987Z	info	klog	manager "msg"="starting metrics server"  "path"="/metrics"
2022-03-25T06:53:56.796043Z	info	klog	tls-provider/root-ca-watcher "msg"="updating root CAs from file" "file"="/var/run/secrets/istio-csr/tls.crt"
2022-03-25T06:53:57.180152Z	info	klog	tls-provider "msg"="serving certificate ready"
2022-03-25T06:53:57.180217Z	info	spiffe	Added 1 certs to trust domain cluster.local in peer cert verifier
2022-03-25T06:53:57.180369Z	info	klog	tls-provider "msg"="fetched initial serving certificate"
2022-03-25T06:53:57.180411Z	info	klog	tls-provider "msg"="waiting to renew certificate"  "renewal-time"="2022-03-25T22:53:57.060128122Z"
2022-03-25T06:53:57.297049Z	info	klog	grpc-server "msg"="grpc serving" "serving-addr"="0.0.0.0:6443" "address"="[::]:6443"
2022-03-25T06:54:19.315736Z	error	klog	Failed to update lock: Operation cannot be fulfilled on configmaps "istio-csr": the object has been modified; please apply your changes to the latest version and try again
2022-03-25T09:23:41.509654Z	info	klog	Waited for 1.183671823s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/cert-manager.io/v1/namespaces/istio-system/certificaterequests/istio-csr-sdtmf
2022-03-25T10:10:02.886473Z	error	klog	grpc-server "msg"="failed to authenticate request" "error"="failed to validate the JWT from cluster \"loadus1-eks\": the service account authentication returns an error: [invalid bearer token, rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /etc/kubernetes/key-server/pipe.sock: connect: no such file or directory\", Post \"https://127.0.0.1:21362/authenticate?timeout=30s\": dial tcp 127.0.0.1:21362: connect: connection refused]" "serving-addr"="0.0.0.0:6443"
2022-03-25T10:10:02.990037Z	error	klog	grpc-server "msg"="failed to authenticate request" "error"="failed to validate the JWT from cluster \"loadus1-eks\": the service account authentication returns an error: [invalid bearer token, rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /etc/kubernetes/key-server/pipe.sock: connect: no such file or directory\", Post \"https://127.0.0.1:21362/authenticate?timeout=30s\": dial tcp 127.0.0.1:21362: connect: connection refused]" "serving-addr"="0.0.0.0:6443"

@anannaya anannaya changed the title Istio-csr pod was hung due to this Entire cluster downtime for new pods. Istio-csr pods were unable to handle request causes entire cluster downtime for new pods/expired pods. Mar 25, 2022
@anannaya anannaya changed the title Istio-csr pods were unable to handle request causes entire cluster downtime for new pods/expired pods. Istio-csr pods were hung unable to handle request causes entire cluster downtime for new pods/expired pods. Mar 25, 2022
@JoshVanL
Copy link
Contributor

JoshVanL commented Mar 25, 2022

Hi @anannaya,

Can you please share the istio-csr version and configuration you used. Sharing your kube and istio version will also help.

one pods was hung at

Can you please set the logging to debug (--v=5). It is not immediately clear to me that istio-csr is actually hanging here.

etcdserver: request timed out

This suggests that your etcd is under resourced. Are their other system components experiencing the same errors? What are the logs of the API server?

@anannaya
Copy link
Author

anannaya commented Mar 25, 2022

istio-csr version is --> 0.3.0
kubernetes version --> v1.19.13-eks-8df270

This suggests that your etcd is under resourced. Are their other system components experiencing the same errors? What are the logs of the API server?
--> We use AWS EKS , i totally agree there might be intermediate control-plane network issues, but there should be retry ? OR other istio-csr pod should take care. kubectl commands was working, metrics everything was working as usual.

Now . I have upgraded the istio-csr to 0.4.0 , with a loglevel 5 .

Is the istio-csr used in production by anyone yet ?

@ethompsy
Copy link

ethompsy commented Aug 9, 2022

Well, we were thinking about putting it in production until we saw this post. Was there any further resolution since March?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants