Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] v1.2.3 longhorn.manager readinnes probe fails with connection refused #8597

Open
IJOL opened this issue May 19, 2024 · 1 comment
Open
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage

Comments

@IJOL
Copy link

IJOL commented May 19, 2024

After 2 years of rock solid stability, suddenly longhorn-manager has started to fail readinnes probes with a connection refused to the 9500 port, and volumes fail to start, need help trying to solve this, apparently the pods are running at least the pod says so, logs doesnt have anything that helps, our system has 3 manager / storage nodes, and comprises about 73 volumes, not that much i should say, below is the log from a manager pod:
2024/05/19 07:36:35 proto: duplicate proto type registered: VersionResponse time="2024-05-19T07:36:35Z" level=info msg="Start overwriting built-in settings with customized values" W0519 07:36:35.528545 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. time="2024-05-19T07:36:35Z" level=info msg="cannot list the content of the src directory /var/lib/rancher/longhorn/engine-binaries for the copy, will do nothing: Failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net bash -c ls /var/lib/rancher/longhorn/engine-binaries/*], output , stderr, ls: cannot access '/var/lib/rancher/longhorn/engine-binaries/*': No such file or directory\n, error exit status 2" I0519 07:36:35.537768 1 leaderelection.go:242] attempting to acquire leader lease longhorn-system/longhorn-manager-upgrade-lock... time="2024-05-19T07:36:35Z" level=info msg="New upgrade leader elected: chi.xxxx.us"

the 3 nodes have logs identical to this one, varying upgrade leader and such, sometimes his does a cron jobs cleanup or alike but nothing very helpful.....

Any help will be greatly appreciated, machines were thoroughly tested for hardware glitches all the night but nothing was found to justify what we have seen, prior to the actual state, where none of the managers overcomes readinnes, we saw managers appearing and disappearing from web ui, did a complete restart, and we were able to get all the system working again, everything green again, but started the same behaviour, node appearing and disappearing from the ui, we decided to test the hardware , but this morning the managers fail to start entirely..

@IJOL IJOL added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels May 19, 2024
@shuo-wu
Copy link
Contributor

shuo-wu commented May 28, 2024

What's your current Longhorn version? Have you upgraded Longhorn system recently?

The error log mentioned that longhorn-manager tried to check the engine binaries under host directory /var/lib/rancher/longhorn/engine-binaries then got nothing. But Longhorn already changed to use dir /longhorn/engine-binaries a long time ago... Would you check these 2 directories manually?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Status: Pending user response
Development

No branches or pull requests

2 participants