[BUG] New replicas are not being created and stuck at 'stopped' state on one of the longhorn nodes #8535

batulziiy · 2024-05-09T13:00:33Z

Describe the bug

I found that new replica is not being created on 1 of the longhorn nodes and when the replica tries to be created there, it just gets stuck at 'stopped' state. Around 1.5 months ago, some of the longhorn nodes were restarted and it could be the reason behind why it's functioning like this. Rest of the nodes work normal and replicas are distributed across them without any issues. I checked replicas, instance and manager log, then found something interesting here below.
time="2024-05-09T12:23:45Z" level=debug msg="Replica rebuildings for map[pvc1:{} pvc2:{} pvc3:{} pvc4:{} pvc5:{}] are in progress on this node, which reaches or exceeds the concurrent limit value 5" controller=longhorn-replica dataPath= node=node3 nodeID=node3 ownerID=node3 replica=replica-id

As stated above, number of concurrent limit has reached its max value, however, there's only 1 replica is rebuilding at the moment. More over, the replicas (pvc1-5) mentioned above actually doesn't exist when I check them in longhorn UI, but they do exist on the disk physically. Could it be the root cause? Should I delete them? Appreciate for your help.

To Reproduce

It's a bit difficult to reproduce the situation, I guess.

Expected behavior

Expecting new replicas of a volume created on the node without getting stuck.

Support bundle for troubleshooting

I will send the support bundle to longhorn-support-bundle@suse.com including issue id in subject field.

Environment

Longhorn version: 1.4.0
Impacted volume (PV): pvc-453745e6-7869-4980-b5a6-52533d7a884a
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s version v1.25.4+k3s1
- Number of control plane nodes in the cluster: 5
- Number of worker nodes in the cluster: 5
Node config
- OS type and version: RHEL 9.1
- Kernel version: 5.15.0-5.76.5.1.el9uek.x86_64
- CPU per node: 2
- Memory per node: 128GB
- Disk type (e.g. SSD/NVMe/HDD): NVMe
- Network bandwidth between the nodes (Gbps): 10Gbps
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 52

Additional context

Only node3 is having this issue while rest of the nodes working normal.

The text was updated successfully, but these errors were encountered:

batulziiy · 2024-05-21T06:02:54Z

After setting 'Concurrent replica rebuild per node limit' from 5 to 8, Longhorn were able to rebuild the replicas that were stuck. But I still think the issue exists and planning to reboot the machine in near future.

batulziiy added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels May 9, 2024

batulziiy closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] New replicas are not being created and stuck at 'stopped' state on one of the longhorn nodes #8535

[BUG] New replicas are not being created and stuck at 'stopped' state on one of the longhorn nodes #8535

batulziiy commented May 9, 2024 •

edited

batulziiy commented May 21, 2024

[BUG] New replicas are not being created and stuck at 'stopped' state on one of the longhorn nodes #8535

[BUG] New replicas are not being created and stuck at 'stopped' state on one of the longhorn nodes #8535

Comments

batulziiy commented May 9, 2024 • edited

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

batulziiy commented May 21, 2024

batulziiy commented May 9, 2024 •

edited