[BUG] New replicas are not being created and stuck at 'stopped' state on one of the longhorn nodes #8535
Labels
kind/bug
require/backport
Require backport. Only used when the specific versions to backport have not been definied.
require/qa-review-coverage
Require QA to review coverage
Describe the bug
I found that new replica is not being created on 1 of the longhorn nodes and when the replica tries to be created there, it just gets stuck at 'stopped' state. Around 1.5 months ago, some of the longhorn nodes were restarted and it could be the reason behind why it's functioning like this. Rest of the nodes work normal and replicas are distributed across them without any issues. I checked replicas, instance and manager log, then found something interesting here below.
time="2024-05-09T12:23:45Z" level=debug msg="Replica rebuildings for map[pvc1:{} pvc2:{} pvc3:{} pvc4:{} pvc5:{}] are in progress on this node, which reaches or exceeds the concurrent limit value 5" controller=longhorn-replica dataPath= node=node3 nodeID=node3 ownerID=node3 replica=replica-id
As stated above, number of concurrent limit has reached its max value, however, there's only 1 replica is rebuilding at the moment. More over, the replicas (pvc1-5) mentioned above actually doesn't exist when I check them in longhorn UI, but they do exist on the disk physically. Could it be the root cause? Should I delete them? Appreciate for your help.
To Reproduce
It's a bit difficult to reproduce the situation, I guess.
Expected behavior
Expecting new replicas of a volume created on the node without getting stuck.
Support bundle for troubleshooting
I will send the support bundle to longhorn-support-bundle@suse.com including issue id in subject field.
Environment
Additional context
Only node3 is having this issue while rest of the nodes working normal.
The text was updated successfully, but these errors were encountered: