Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] race condition in longhorn-manager certificate renewal #8433

Open
nazarewk opened this issue Apr 24, 2024 · 1 comment
Open

[BUG] race condition in longhorn-manager certificate renewal #8433

nazarewk opened this issue Apr 24, 2024 · 1 comment
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage

Comments

@nazarewk
Copy link

nazarewk commented Apr 24, 2024

Describe the bug

We are observing what looks like a race condition between longhorn-manager pods. All longhorn-manager pods keep trying to update longhorn-webhook-tls tens to hundreds times per second resulting in logs filled with those errors:

time="2024-04-23T18:38:02Z" level=error msg="Failed to save TLS secret for longhorn-system/longhorn-webhook-tls: Operation cannot be fulfilled on secrets \"longhorn-webhook-tls\": the object has been modified; please apply your changes to the latest version and try again" func="kubernetes.(*storage).Update.func1" file="controller.go:236"

What we have observed (through kubectl watch on the secret and deciphering certs) is that updates are flipping between 2 certificates (always the same 2) differing only by serial numbers.

Might be related to renewal 90 days before expiry.

To Reproduce

not sure how to reproduce, as we are not 100% sure of the cause

Expected behavior

  • ideally longhorn-manager pod properly renews the certificate
  • acceptably manager aborts certificate update after seeing it's not differing meaningfully with existing version
  • at worst the certificate tries to be updated less than once per second indefinitely

Support bundle for troubleshooting

Environment

  • Longhorn version: 1.5.4
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: kubespray
    • Number of control plane nodes in the cluster: 3
    • Number of worker nodes in the cluster: 16
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMWare
  • Number of Longhorn volumes in the cluster:

Additional context

retaled to #5571

slack thread: just my notes without reply

@nazarewk nazarewk added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Apr 24, 2024
@innobead
Copy link
Member

cc @ChanYiLin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Development

No branches or pull requests

2 participants