[BUG] race condition in longhorn-manager certificate renewal #8433

nazarewk · 2024-04-24T14:50:13Z

Describe the bug

We are observing what looks like a race condition between longhorn-manager pods. All longhorn-manager pods keep trying to update longhorn-webhook-tls tens to hundreds times per second resulting in logs filled with those errors:

time="2024-04-23T18:38:02Z" level=error msg="Failed to save TLS secret for longhorn-system/longhorn-webhook-tls: Operation cannot be fulfilled on secrets \"longhorn-webhook-tls\": the object has been modified; please apply your changes to the latest version and try again" func="kubernetes.(*storage).Update.func1" file="controller.go:236"

What we have observed (through kubectl watch on the secret and deciphering certs) is that updates are flipping between 2 certificates (always the same 2) differing only by serial numbers.

Might be related to renewal 90 days before expiry.

To Reproduce

not sure how to reproduce, as we are not 100% sure of the cause

Expected behavior

ideally longhorn-manager pod properly renews the certificate
acceptably manager aborts certificate update after seeing it's not differing meaningfully with existing version
at worst the certificate tries to be updated less than once per second indefinitely

Support bundle for troubleshooting

Environment

Longhorn version: 1.5.4
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: kubespray
- Number of control plane nodes in the cluster: 3
- Number of worker nodes in the cluster: 16
Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMWare
Number of Longhorn volumes in the cluster:

Additional context

retaled to #5571

slack thread: just my notes without reply

The text was updated successfully, but these errors were encountered:

innobead · 2024-04-24T14:59:38Z

cc @ChanYiLin

nazarewk added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] race condition in longhorn-manager certificate renewal #8433

[BUG] race condition in longhorn-manager certificate renewal #8433

nazarewk commented Apr 24, 2024 •

edited

innobead commented Apr 24, 2024

[BUG] race condition in longhorn-manager certificate renewal #8433

[BUG] race condition in longhorn-manager certificate renewal #8433

Comments

nazarewk commented Apr 24, 2024 • edited

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

innobead commented Apr 24, 2024

nazarewk commented Apr 24, 2024 •

edited