Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod-network-loss: cleanup fails because target pod has been restarted #591

Open
ganto opened this issue Oct 20, 2022 · 5 comments · May be fixed by #702
Open

pod-network-loss: cleanup fails because target pod has been restarted #591

ganto opened this issue Oct 20, 2022 · 5 comments · May be fixed by #702
Assignees
Labels
enhancement New feature or request

Comments

@ganto
Copy link

ganto commented Oct 20, 2022

BUG REPORT

What happened:
I ran the chaos experiment 'pod-network-loss' against a pod which was able to successfully inject the Linux traffic control rule to block the traffic. This resulted in the pod failing the network-based liveness probe which caused Kubernetes to kill and restart the pod. Once the experiment ended, the traffic control rule was attempted to be reverted but this failed because the original process in the pod was not running anymore. Obviously this resulted in a failed helper pod and an application pod that was stuck in the state CrashLoopBackOff because the network-based readiness probe could not succeed because of the blocked traffic.

time="2022-10-20T14:20:33Z" level=info msg="Helper Name: network-chaos"
time="2022-10-20T14:20:33Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2022-10-20T14:20:33Z" level=info msg="container ID of tls-terminator container, containerID: 3c762d3ba26f21bb7cd41d92bb5161793750e9f3db11ae317f72ddf8cdba5d44"
time="2022-10-20T14:20:33Z" level=info msg="Container ID: 3c762d3ba26f21bb7cd41d92bb5161793750e9f3db11ae317f72ddf8cdba5d44"
time="2022-10-20T14:20:33Z" level=info msg="[Info]: Container ID=3c762d3ba26f21bb7cd41d92bb5161793750e9f3db11ae317f72ddf8cdba5d44 has process PID=360376"
time="2022-10-20T14:20:33Z" level=info msg="/bin/bash -c sudo nsenter -t 360376 -n tc qdisc replace dev eth0 root netem loss 100"
time="2022-10-20T14:20:34Z" level=info msg="[Chaos]: Waiting for 300s"
time="2022-10-20T14:25:34Z" level=info msg="[Chaos]: Stopping the experiment"
time="2022-10-20T14:25:34Z" level=info msg="/bin/bash -c sudo nsenter -t 360376 -n tc qdisc delete dev eth0 root"
time="2022-10-20T14:25:34Z" level=error msg="nsenter: can't open '/proc/360376/ns/net': No such file or directory\n"
time="2022-10-20T14:25:34Z" level=fatal msg="helper pod failed, err: exit status 1"

What you expected to happen:
Once the experiment completes the traffic control rule is successfully removed so that the application pod is able to properly function again.

How to reproduce it (as minimally and precisely as possible):

  • Setup the experiment with all the necessary Kubernetes resources
  • Create a deployment with a network-based liveness probe. E.g.:
      livenessProbe:
        httpGet:
          path: /healthz
          port: http
          scheme: HTTP
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
  • Run the ChaosEngine with a long enough TOTAL_CHAOS_DURATION so that the liveness probe reaches the failure threshold and the pod is killed

Anything else we need to know?:

@ganto
Copy link
Author

ganto commented Oct 21, 2022

I investigated the situation a bit further and think other experiments such as the pod-network-latency are affected by the same problem:

time="2022-10-21T09:10:28Z" level=info msg="Helper Name: network-chaos"
time="2022-10-21T09:10:28Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2022-10-21T09:10:28Z" level=info msg="container ID of tls-terminator container, containerID: a469af7c9821084eee6cb44d740a56f64ceac4e378dbeb5c842ad0b5c0cf1b31"
time="2022-10-21T09:10:28Z" level=info msg="Container ID: a469af7c9821084eee6cb44d740a56f64ceac4e378dbeb5c842ad0b5c0cf1b31"
time="2022-10-21T09:10:28Z" level=info msg="[Info]: Container ID=a469af7c9821084eee6cb44d740a56f64ceac4e378dbeb5c842ad0b5c0cf1b31 has process PID=5028"
time="2022-10-21T09:10:28Z" level=info msg="/bin/bash -c sudo nsenter -t 5028 -n tc qdisc replace dev eth0 root netem delay 2000ms 0ms"
time="2022-10-21T09:10:29Z" level=info msg="[Chaos]: Waiting for 300s"
time="2022-10-21T09:15:29Z" level=info msg="[Chaos]: Stopping the experiment"
time="2022-10-21T09:15:29Z" level=info msg="/bin/bash -c sudo nsenter -t 5028 -n tc qdisc delete dev eth0 root"
time="2022-10-21T09:15:29Z" level=error msg="nsenter: can't open '/proc/5028/ns/net': No such file or directory\n"
time="2022-10-21T09:15:29Z" level=fatal msg="helper pod failed, err: exit status 1"

A possible solution would be to replace the nsenter -t <pid> -n commands with ip netns. Something like:

$ netid=$(ip netns identify <pid>)
$ ip netns $netid exec tc qdisk ...

With this approach a container restart doesn't affect the cleanup. Additionally before the cleanup there can also be a check to verify if the network namespace still exists:

$ ip netns list | grep $netid

If not it means the pod has been recreated in the meantime and no cleanup is necessary anymore.

@ganto
Copy link
Author

ganto commented Oct 24, 2022

I did some prototyping and found that it is a bit more complicated than I originally thought. The ip netns identify seems to only return a useful network namespace ID when being executed directly on the host:

sh-4.4# ip netns identify 121688
4d2661e2-c926-40cf-83e6-20ff5dd1aecb

When running in a privileged container however, it cannot resolve the "name" of the namespace:

~ $ sudo ip netns identify 121688

~ $

While it would be possible to retrieve the value via container runtime, similar to the container PID...

sh-4.4# crictl inspect 994f41ddf38cb | jq '.info.runtimeSpec.linux.namespaces[] | select(.type == "network")'
{
  "type": "network",
  "path": "/var/run/netns/4d2661e2-c926-40cf-83e6-20ff5dd1aecb"
}

obviously the ip netns exec commands would still fail because the namespace is not known within the container:

~ $ sudo ip netns exec 4d2661e2-c926-40cf-83e6-20ff5dd1aecb tc qdisc replace dev eth0 root netem loss 100
Cannot open network namespace "4d2661e2-c926-40cf-83e6-20ff5dd1aecb": No such file or directory

@neelanjan00 neelanjan00 added the enhancement New feature or request label Apr 4, 2024
@neelanjan00
Copy link
Member

Thanks for raising this issue! We will add this feature in the subsequent releases. Thanks again for being so patient!

@Calvinaud
Copy link
Contributor

Hello,
We also encounter this issue.

The root cause seem the same for us, the pod getting restarted due to a liveness probe and the helper doesn't revert the chaos because the container ID / process changed and so the chaos never get reverted.

Since the container/process changed, if we include as part of the revert process to re-fetching the container ID and the process before actual clean-up be a viable solution ?
@ispeakc0de @ganto Your opinion on it ?

@Calvinaud
Copy link
Contributor

Calvinaud commented May 22, 2024

I think the solution I proposed still has some gap since the pod could still be restarted between the time we are fetching the container ID/process and do the actual clean-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants