[BUG] v1.2.3 longhorn.manager readinnes probe fails with connection refused #8597

IJOL · 2024-05-19T07:48:29Z

After 2 years of rock solid stability, suddenly longhorn-manager has started to fail readinnes probes with a connection refused to the 9500 port, and volumes fail to start, need help trying to solve this, apparently the pods are running at least the pod says so, logs doesnt have anything that helps, our system has 3 manager / storage nodes, and comprises about 73 volumes, not that much i should say, below is the log from a manager pod:
2024/05/19 07:36:35 proto: duplicate proto type registered: VersionResponse time="2024-05-19T07:36:35Z" level=info msg="Start overwriting built-in settings with customized values" W0519 07:36:35.528545 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. time="2024-05-19T07:36:35Z" level=info msg="cannot list the content of the src directory /var/lib/rancher/longhorn/engine-binaries for the copy, will do nothing: Failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net bash -c ls /var/lib/rancher/longhorn/engine-binaries/*], output , stderr, ls: cannot access '/var/lib/rancher/longhorn/engine-binaries/*': No such file or directory\n, error exit status 2" I0519 07:36:35.537768 1 leaderelection.go:242] attempting to acquire leader lease longhorn-system/longhorn-manager-upgrade-lock... time="2024-05-19T07:36:35Z" level=info msg="New upgrade leader elected: chi.xxxx.us"

the 3 nodes have logs identical to this one, varying upgrade leader and such, sometimes his does a cron jobs cleanup or alike but nothing very helpful.....

Any help will be greatly appreciated, machines were thoroughly tested for hardware glitches all the night but nothing was found to justify what we have seen, prior to the actual state, where none of the managers overcomes readinnes, we saw managers appearing and disappearing from web ui, did a complete restart, and we were able to get all the system working again, everything green again, but started the same behaviour, node appearing and disappearing from the ui, we decided to test the hardware , but this morning the managers fail to start entirely..

The text was updated successfully, but these errors were encountered:

shuo-wu · 2024-05-28T18:36:22Z

What's your current Longhorn version? Have you upgraded Longhorn system recently?

The error log mentioned that longhorn-manager tried to check the engine binaries under host directory /var/lib/rancher/longhorn/engine-binaries then got nothing. But Longhorn already changed to use dir /longhorn/engine-binaries a long time ago... Would you check these 2 directories manually?

IJOL added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] v1.2.3 longhorn.manager readinnes probe fails with connection refused #8597

[BUG] v1.2.3 longhorn.manager readinnes probe fails with connection refused #8597

IJOL commented May 19, 2024

shuo-wu commented May 28, 2024

[BUG] v1.2.3 longhorn.manager readinnes probe fails with connection refused #8597

[BUG] v1.2.3 longhorn.manager readinnes probe fails with connection refused #8597

Comments

IJOL commented May 19, 2024

shuo-wu commented May 28, 2024