Stopping phantom pods from accessing resources #3

rhperera · 2020-01-08T15:40:13Z

Hi, thanks for the nice project.
I just wanted to know what happens when a pod becomes a phantom and whether it's killed before another pod takes over. If I break it down,

I have two pods primary and a secondary
Pods are in different machines
Only primary will write to a file system and db
Primary looses connection with Kube but still has access to the resources and keeps writing thinking it's still the primary
The secondary pod takes over as the primary
Starts to write to the same resources as a primary.
Now two pods are writing to the resources (which I want to avoid)

kvaps · 2020-01-08T15:50:31Z

Hi, fencing-controller is not operating with failed pods, it is watching only for the nodes, eg running:

kubectl get node -w -l fencing=enabled

And if some node is changing state to NotReady, it will check the reason:

kubectl get node <nodename> -o 'custom-columns=STATUS:.status.conditions[?(@.type=="Ready")].reason'

And if reason is NodeStatusUnknown it will initiate the fencing procedure, by calling:

kubectl exec -n fencing fencing-agents-577dff5bf8-fp5np /scripts/fence.sh <nodename>

Here your script fence.sh should grantee that node was successful killed and exit with exit 0, only after that, fencing-controller will erase the node and pods from it.

rhperera · 2020-01-12T17:28:01Z

So if the node and the kube API server looses the connection and will not be able to recover from it, can the fencing agent shutdown the node by itself?

aptdamia · 2020-01-30T09:06:27Z

In my understanding, the fencing-controller will be monitoring the node status for "NodeStatusUnknown".
Once this happens, it will have a wait time to ensure it wasn't temporary, then it will trigger the fencing part.
This will select an available agent and run a fencing script. This is configurable by you using the deployment templates, but the goal is to run commands to shutdown the server.
In some cases, the server may not be responsive using your standard network, and for this reason it is good to include a STONITH approach like shutting it down using IPMI/ILO/etc.

pacoxu · 2020-09-10T03:50:04Z

In my understanding, the fencing-controller will be monitoring the node status for "NodeStatusUnknown".
Once this happens, it will have a wait time to ensure it wasn't temporary, then it will trigger the fencing part.
This will select an available agent and run a fencing script. This is configurable by you using the deployment templates, but the goal is to run commands to shutdown the server.
In some cases, the server may not be responsive using your standard network, and for this reason it is good to include a STONITH approach like shutting it down using IPMI/ILO/etc.

IPMI info may lead to more quick response to node power off.
To drain a node, hardware signals would be the strongest clue. Network down for a period would be a good basis as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopping phantom pods from accessing resources #3

Stopping phantom pods from accessing resources #3

rhperera commented Jan 8, 2020

kvaps commented Jan 8, 2020 •

edited

rhperera commented Jan 12, 2020

aptdamia commented Jan 30, 2020

pacoxu commented Sep 10, 2020

Stopping phantom pods from accessing resources #3

Stopping phantom pods from accessing resources #3

Comments

rhperera commented Jan 8, 2020

kvaps commented Jan 8, 2020 • edited

rhperera commented Jan 12, 2020

aptdamia commented Jan 30, 2020

pacoxu commented Sep 10, 2020

kvaps commented Jan 8, 2020 •

edited