Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopping phantom pods from accessing resources #3

Open
rhperera opened this issue Jan 8, 2020 · 4 comments
Open

Stopping phantom pods from accessing resources #3

rhperera opened this issue Jan 8, 2020 · 4 comments

Comments

@rhperera
Copy link

rhperera commented Jan 8, 2020

Hi, thanks for the nice project.
I just wanted to know what happens when a pod becomes a phantom and whether it's killed before another pod takes over. If I break it down,

  • I have two pods primary and a secondary
  • Pods are in different machines
  • Only primary will write to a file system and db
  • Primary looses connection with Kube but still has access to the resources and keeps writing thinking it's still the primary
  • The secondary pod takes over as the primary
  • Starts to write to the same resources as a primary.
  • Now two pods are writing to the resources (which I want to avoid)
@kvaps
Copy link
Owner

kvaps commented Jan 8, 2020

Hi, fencing-controller is not operating with failed pods, it is watching only for the nodes, eg running:

kubectl get node -w -l fencing=enabled

And if some node is changing state to NotReady, it will check the reason:

kubectl get node <nodename> -o 'custom-columns=STATUS:.status.conditions[?(@.type=="Ready")].reason'

And if reason is NodeStatusUnknown it will initiate the fencing procedure, by calling:

kubectl exec -n fencing fencing-agents-577dff5bf8-fp5np /scripts/fence.sh <nodename>

Here your script fence.sh should grantee that node was successful killed and exit with exit 0, only after that, fencing-controller will erase the node and pods from it.

@rhperera
Copy link
Author

So if the node and the kube API server looses the connection and will not be able to recover from it, can the fencing agent shutdown the node by itself?

@aptdamia
Copy link

In my understanding, the fencing-controller will be monitoring the node status for "NodeStatusUnknown".
Once this happens, it will have a wait time to ensure it wasn't temporary, then it will trigger the fencing part.
This will select an available agent and run a fencing script. This is configurable by you using the deployment templates, but the goal is to run commands to shutdown the server.
In some cases, the server may not be responsive using your standard network, and for this reason it is good to include a STONITH approach like shutting it down using IPMI/ILO/etc.

@pacoxu
Copy link

pacoxu commented Sep 10, 2020

In my understanding, the fencing-controller will be monitoring the node status for "NodeStatusUnknown".
Once this happens, it will have a wait time to ensure it wasn't temporary, then it will trigger the fencing part.
This will select an available agent and run a fencing script. This is configurable by you using the deployment templates, but the goal is to run commands to shutdown the server.
In some cases, the server may not be responsive using your standard network, and for this reason it is good to include a STONITH approach like shutting it down using IPMI/ILO/etc.

IPMI info may lead to more quick response to node power off.
To drain a node, hardware signals would be the strongest clue. Network down for a period would be a good basis as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants