Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent csi-driver-spiffe failure: Unable to mount cert #42

Open
warrior-abhijit opened this issue Oct 10, 2023 · 1 comment
Open

Comments

@warrior-abhijit
Copy link

warrior-abhijit commented Oct 10, 2023

We have encountered intermittent issues where the CSI driver spiffe fails to mount the certificate on a pod.
This problem appears to occur more frequently when the CSI driver spiffe pod restarts.
Upon restarting the CSI driver spiffe pod, it seems to lose track of which pod certificates need to be renewed.
Interestingly, manually restarting the affected pod results in the correct mounting of new certificates.

We observed the following error messages in the csi-driver-spiffe log:

csi/manager "msg"="Failed to issue certificate, retrying after applying exponential backoff" "error"="waiting for request: certificaterequest.cert-manager.io \"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\" not found" "volume_id"="csi-xxxxxxxx"
.......
csi/driver "msg"="failed processing request" "error"="timed out waiting for the condition" "request"={}
"rpc_method"="/csi.v1.Node/NodePublishVolume"
.......

We have already reviewed a previously closed issue (cert-manager/csi-driver#78) and updated the CSI data directory, but this did not resolve the problem.
We are actively looking for workarounds to address this behavior. One potential solution we are considering is utilizing a liveness probe.
We are seeking guidance on how to further identify and potentially resolve this issue.
Any suggestions regarding additional information we can provide would be greatly appreciated.

@warrior-abhijit warrior-abhijit changed the title Intermittent csi-driver-spiffefailure: Unable to mount cert Intermittent csi-driver-spiffe failure: Unable to mount cert Oct 10, 2023
@maelvls
Copy link
Member

maelvls commented Oct 10, 2023

csi/driver "msg"="failed processing request" "error"="timed out waiting for the condition" "request"={} "rpc_method"="/csi.v1.Node/NodePublishVolume"

Something hangs in csi-driver-spiffe when processing the NodePublishVolume request coming from the node's kubelet. I'll look at csi-lib's code to see where that might be happening.

  • Would it be possible to use e.g. --set app.logLevel=5 in Helm to see more of the csi-driver-spiffe logs?

  • Although I'm not sure how much info we will get out of it, would it be a way to get this node's kubelet logs? For example, on EKS, you can gather the logs in the node's /var/log/messages. The logs would look like this:

    Oct 10 15:39:08 spiffe-control-plane kubelet[184]: I1010 15:39:08.461940     184 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-f47pf\" (UniqueName: \"kubernetes.io/projected/d0946ec9-6327-481b-999d-a5d2b89904da-kube-api-access-f47pf\") pod \"my-csi-app-676cb86596-6r9bh\" (UID: \"d0946ec9-6327-481b-999d-a5d2b89904da\") " pod="sandbox/my-csi-app-676cb86596-6r9bh"
    Oct 10 15:39:08 spiffe-control-plane kubelet[184]: I1010 15:39:08.461984     184 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"spiffe\" (UniqueName: \"kubernetes.io/csi/d0946ec9-6327-481b-999d-a5d2b89904da-spiffe\") pod \"my-csi-app-676cb86596-6r9bh\" (UID: \"d0946ec9-6327-481b-999d-a5d2b89904da\") " pod="sandbox/my-csi-app-676cb86596-6r9bh"
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants