Stop fencing actions when node gets healthy #159

slintes · 2023-11-03T09:44:09Z

There is a good chance that reboots solves issues on the node, and the node gets healthy again. NHC will delete the SNR CR in that case.

When SNR assumes the node rebooted by waiting some time, it just continues fencing by deleting resources or adding the out-of-service taint though. This isn't a big issue, because there shouldn't be any workloads running after the reboot (because of the "normal" NoExecute taint).

However, it probably makes sense to skip this step, because there is no need anymore to delete the remaining pods which tolerate the NoExecute taint on a healthy node. Probably we can switch directly to the "FencingCompleted" code branch, which does the usual cleanup, like removing that NoExecute taint.

@k-keiichi-rh @mshitrit

This was triggered by the discussion here: medik8s/fence-agents-remediation#92 (comment)

k-keiichi-rh · 2023-11-03T18:42:30Z

As for stopping any further fencing action on the healthy node, I think the basic idea here is that the control-plane should handle the fencing action if the failed node can communicate with the control-plane.
...

not sure if I understand, what do you mean with "the control-plane should handle the fencing action"?

I just meant that we don't need to fence by deleting resources or adding the out-of-service taint if the node is healthy.
If the node can communicate with the control-plane and can report its status to it, the stateful workloads to be stuck will failover to another node without any action by SNR. So we can expect the control-plane or kubelet on the failed node is responsible for handling the stateful workloads.

k-keiichi-rh · 2023-11-03T19:04:58Z

I think we need to do some "cleanup" in SNR, e.g removing taints which were already set in the pre-reboot phase. Maybe we can just switch to the fencing completed phase directly, it should do everything we need for cleanup?

I think so too. We can just switch to the fencing completed phase if the SNR CR has the deletion timestamp.

mshitrit · 2023-11-05T12:22:49Z

I think I'm leaning into staying with the original approach.
At this stage we already know that there is no workload running on the node so I think it makes sense to follow through with deleting it from the API server (even if the node is healthy).
I'm having hard time to see the value in this change when comparing to the risk of introducing new code.

slintes mentioned this issue Nov 3, 2023

Enable out-of-service taint in FAR medik8s/fence-agents-remediation#92

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop fencing actions when node gets healthy #159

Stop fencing actions when node gets healthy #159

slintes commented Nov 3, 2023

k-keiichi-rh commented Nov 3, 2023

k-keiichi-rh commented Nov 3, 2023

mshitrit commented Nov 5, 2023

Stop fencing actions when node gets healthy #159

Stop fencing actions when node gets healthy #159

Comments

slintes commented Nov 3, 2023

k-keiichi-rh commented Nov 3, 2023

k-keiichi-rh commented Nov 3, 2023

mshitrit commented Nov 5, 2023