Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] New replicas are not being created and stuck at 'stopped' state on one of the longhorn nodes #8535

Closed
batulziiy opened this issue May 9, 2024 · 1 comment
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage

Comments

@batulziiy
Copy link

batulziiy commented May 9, 2024

Describe the bug

I found that new replica is not being created on 1 of the longhorn nodes and when the replica tries to be created there, it just gets stuck at 'stopped' state. Around 1.5 months ago, some of the longhorn nodes were restarted and it could be the reason behind why it's functioning like this. Rest of the nodes work normal and replicas are distributed across them without any issues. I checked replicas, instance and manager log, then found something interesting here below.
time="2024-05-09T12:23:45Z" level=debug msg="Replica rebuildings for map[pvc1:{} pvc2:{} pvc3:{} pvc4:{} pvc5:{}] are in progress on this node, which reaches or exceeds the concurrent limit value 5" controller=longhorn-replica dataPath= node=node3 nodeID=node3 ownerID=node3 replica=replica-id

As stated above, number of concurrent limit has reached its max value, however, there's only 1 replica is rebuilding at the moment. More over, the replicas (pvc1-5) mentioned above actually doesn't exist when I check them in longhorn UI, but they do exist on the disk physically. Could it be the root cause? Should I delete them? Appreciate for your help.

To Reproduce

It's a bit difficult to reproduce the situation, I guess.

Expected behavior

Expecting new replicas of a volume created on the node without getting stuck.

Support bundle for troubleshooting

I will send the support bundle to longhorn-support-bundle@suse.com including issue id in subject field.

Environment

  • Longhorn version: 1.4.0
  • Impacted volume (PV): pvc-453745e6-7869-4980-b5a6-52533d7a884a
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s version v1.25.4+k3s1
    • Number of control plane nodes in the cluster: 5
    • Number of worker nodes in the cluster: 5
  • Node config
    • OS type and version: RHEL 9.1
    • Kernel version: 5.15.0-5.76.5.1.el9uek.x86_64
    • CPU per node: 2
    • Memory per node: 128GB
    • Disk type (e.g. SSD/NVMe/HDD): NVMe
    • Network bandwidth between the nodes (Gbps): 10Gbps
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: 52

Additional context

Only node3 is having this issue while rest of the nodes working normal.

@batulziiy batulziiy added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels May 9, 2024
@batulziiy
Copy link
Author

After setting 'Concurrent replica rebuild per node limit' from 5 to 8, Longhorn were able to rebuild the replicas that were stuck. But I still think the issue exists and planning to reboot the machine in near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Development

No branches or pull requests

1 participant