Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop fencing actions when node gets healthy #159

Open
slintes opened this issue Nov 3, 2023 · 3 comments
Open

Stop fencing actions when node gets healthy #159

slintes opened this issue Nov 3, 2023 · 3 comments

Comments

@slintes
Copy link
Member

slintes commented Nov 3, 2023

There is a good chance that reboots solves issues on the node, and the node gets healthy again. NHC will delete the SNR CR in that case.

When SNR assumes the node rebooted by waiting some time, it just continues fencing by deleting resources or adding the out-of-service taint though. This isn't a big issue, because there shouldn't be any workloads running after the reboot (because of the "normal" NoExecute taint).

However, it probably makes sense to skip this step, because there is no need anymore to delete the remaining pods which tolerate the NoExecute taint on a healthy node. Probably we can switch directly to the "FencingCompleted" code branch, which does the usual cleanup, like removing that NoExecute taint.

@k-keiichi-rh @mshitrit

This was triggered by the discussion here: medik8s/fence-agents-remediation#92 (comment)

@k-keiichi-rh
Copy link
Contributor

As for stopping any further fencing action on the healthy node, I think the basic idea here is that the control-plane should handle the fencing action if the failed node can communicate with the control-plane.
...

not sure if I understand, what do you mean with "the control-plane should handle the fencing action"?

I just meant that we don't need to fence by deleting resources or adding the out-of-service taint if the node is healthy.
If the node can communicate with the control-plane and can report its status to it, the stateful workloads to be stuck will failover to another node without any action by SNR. So we can expect the control-plane or kubelet on the failed node is responsible for handling the stateful workloads.

@k-keiichi-rh
Copy link
Contributor

I think we need to do some "cleanup" in SNR, e.g removing taints which were already set in the pre-reboot phase. Maybe we can just switch to the fencing completed phase directly, it should do everything we need for cleanup?

I think so too. We can just switch to the fencing completed phase if the SNR CR has the deletion timestamp.

@mshitrit
Copy link
Member

mshitrit commented Nov 5, 2023

I think I'm leaning into staying with the original approach.
At this stage we already know that there is no workload running on the node so I think it makes sense to follow through with deleting it from the API server (even if the node is healthy).
I'm having hard time to see the value in this change when comparing to the risk of introducing new code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants