[BUG] v1.2.3 longhorn.manager readinnes probe fails with connection refused #8597
Labels
kind/bug
require/backport
Require backport. Only used when the specific versions to backport have not been definied.
require/qa-review-coverage
Require QA to review coverage
After 2 years of rock solid stability, suddenly longhorn-manager has started to fail readinnes probes with a connection refused to the 9500 port, and volumes fail to start, need help trying to solve this, apparently the pods are running at least the pod says so, logs doesnt have anything that helps, our system has 3 manager / storage nodes, and comprises about 73 volumes, not that much i should say, below is the log from a manager pod:
2024/05/19 07:36:35 proto: duplicate proto type registered: VersionResponse time="2024-05-19T07:36:35Z" level=info msg="Start overwriting built-in settings with customized values" W0519 07:36:35.528545 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. time="2024-05-19T07:36:35Z" level=info msg="cannot list the content of the src directory /var/lib/rancher/longhorn/engine-binaries for the copy, will do nothing: Failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net bash -c ls /var/lib/rancher/longhorn/engine-binaries/*], output , stderr, ls: cannot access '/var/lib/rancher/longhorn/engine-binaries/*': No such file or directory\n, error exit status 2" I0519 07:36:35.537768 1 leaderelection.go:242] attempting to acquire leader lease longhorn-system/longhorn-manager-upgrade-lock... time="2024-05-19T07:36:35Z" level=info msg="New upgrade leader elected: chi.xxxx.us"
the 3 nodes have logs identical to this one, varying upgrade leader and such, sometimes his does a cron jobs cleanup or alike but nothing very helpful.....
Any help will be greatly appreciated, machines were thoroughly tested for hardware glitches all the night but nothing was found to justify what we have seen, prior to the actual state, where none of the managers overcomes readinnes, we saw managers appearing and disappearing from web ui, did a complete restart, and we were able to get all the system working again, everything green again, but started the same behaviour, node appearing and disappearing from the ui, we decided to test the hardware , but this morning the managers fail to start entirely..
The text was updated successfully, but these errors were encountered: