Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmdDel fails releasing the device when kubelet deletes pause container #126

Open
blackgold opened this issue May 24, 2020 · 14 comments
Open

Comments

@blackgold
Copy link

blackgold commented May 24, 2020

What happened?

Kubelet doesn't gaurentee to keep pause container alive while cni tries to delete all the devices attached to the pod. When the pause container is deleted, the netns is not available to release the device from cmdDel. This results in the device on host with wrong name, missing ip and wrong settings.

What did you expect to happen?

Kubelet to provide some guarantee that netns is available for the cni to delete all attached devices.

What are the minimal steps needed to reproduce the bug?

Attach atleast 4 sriov devices to pod. Kill the pod.
To consistently reproduce the error, add 1 sec sleep in cmdDel.

Anything else we need to know?

Raised the issue with kubernetes and unable to get any positive response.
kubernetes/kubernetes#89440
As a workaround having a daemon that tries to fix the broken device on the host periodically.

Component Versions

Please fill in the below table with the version numbers of applicable components used.

Component Version
SR-IOV CNI Plugin v2.2
Multus v3.4
SR-IOV Network Device Plugin v2.2
Kubernetes 1.13.5
OS Ubuntu 18

Config Files

Config file locations may be config dependent.

CNI config (Try '/etc/cni/net.d/')
Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)
Kubeconfig file
SR-IOV Network Custom Resource Definition

Logs

SR-IOV Network Device Plugin Logs (use kubectl logs $PODNAME)

Added some custom logs to print cmdArgs and netns
time="2020-04-24T17:22:52Z" level=info msg="read from cache &{NetConf:{CNIVersion:0.3.1 Name:sriov-network Type:sriov Capabilities:map[] IPAM:{Type:} DNS:{Nameservers:[] Domain: Search:[] Options:[]} RawPrevResult:map[dns:map[] interfaces:[map[name:net1 sandbox:/proc/4281/ns/net]]] PrevResult:} DPDKMode:false Master:enp5s0 MAC: AdminMAC: EffectiveMAC: Vlan:0 VlanQoS:0 DeviceID:0000:05:00.1 VFID:0 HostIFNames:net1 ContIFNames:net1 MinTxRate: MaxTxRate: SpoofChk: Trust: LinkState: Delegates:[{CNIVersion:0.3.1 Name:sbr Type:sbr Capabilities:map[] IPAM:{Type:} DNS:{Nameservers:[] Domain: Search:[] Options:[]} RawPrevResult:map[] PrevResult:}] RuntimeConfig:{Mac:} IPNet:}"
time="2020-04-24T17:22:52Z" level=info msg="empty netns , error = failed to Statfs "/proc/4281/ns/net": no such file or directory"
time="2020-04-24T17:22:52Z" level=info msg="ReleaseVF "
time="2020-04-24T17:22:52Z" level=error msg="failed to get netlink device with name net1"

Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)

Mar 23 21:04:42 dgx0098 kubelet[29124]: 2020-03-23T21:04:42Z [error] Multus: error in invoke Delegate del - "sriov": error in removing device from net namespace: 1failed to get netlink device with name net3: Link not found
Mar 23 21:04:42 dgx0098 kubelet[29124]: 2020-03-23T21:04:42Z [debug] delegateDel: , net2, &{{0.3.1 sriov-network sriov map[] {} {[] [] []}} { []} false false [123 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 101 108 101 103 97 116 101 115 34 58 91 123 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 110 9 101 34 58 34 115 98 114 34 44 34 116 121 112 101 34 58 34 115 98 114 34 125 93 44 34 100 101 118 105 99 101 73 68 34 58 34 48 48 48 48 58 48 99 58 48 48 46 49 34 44 34 110 97 109 101 34 58 34 115 114 105 111 118 45 110 101 116 119 111 114 107 34 44 34 116 121 112 101 34 58 34 115 114 105 111 118 34 125]}, &{cfba15035e7ef328153ba5c88853b52f97740560bc27a0707ab2f5b536a8f863 /proc/32764/ns/net net2 [[IgnoreUnknown 1] [K8S_POD_NAMESPACE user] [K8S_POD_NAME 847138-worker-1] [K8S_POD_INFRA_CONTAINER_ID cfba15035e7ef328153ba5c88853b52f97740560bc27a0707ab2f5b536a8f863]] map[] }, /opt/cni/bin
Mar 23 21:04:42 dgx0098 kubelet[29124]: 2020-03-23T21:04:42Z [verbose] Del: user:847138-worker-1:sriov-network:net2 {"cniVersion":"0.3.1","delegates":[{"cniVersion":"0.3.1","name":"sbr","type":"sbr"}],"deviceID":"0000:0c:00.1","name":"sriov-network","type":"sriov"}
Mar 23 21:04:46 dgx0098 kubelet[29124]: I0323 21:04:46.544632 29124 plugins.go:391] Calling network plugin cni to tear down pod "847138-worker-1_user"

@blackgold
Copy link
Author

blackgold commented May 24, 2020

Screen Shot 2020-05-24 at 9 25 03 AM

Kind of using a sriov-cleaner daemon to clean up the device that ends up in bad shape on the host. Looking for suggestions on ideal solution.

@zshi-redhat
Copy link
Collaborator

I think the long term solution would be to wait for the kubelet fix.

For the workaround:
sriov-cni does two things in container namespace when releasing a VF:

  1. rename VF
  2. reset effective MAC address
    If sriov-cni can capture the failure that netns doesn't exist any more in ReleaseVF, it might be able to continue the releasing process in the init netns (assume VF be released to the default host netns upon failure ). Is there any other information need to recover from an early deletion pause container?

@blackgold
Copy link
Author

I tried switching to init namespace, however the device is not visible. The device is visible on the host only after cmdDel command is called on all the devices.

@adrianchiris adrianchiris added the stale This issue did not have any activity nor a conclusion in the past 90 days label Nov 24, 2020
@JaseFace
Copy link

We're running into this currently. We have 4-6 interfaces in use by the CNI, but are often finding 1 or 2 left with a bad interface name, and various settings that weren't reverted. The host usually has enough information to fix the handed back/abandoned interfaces after the failed/incomplete cmdDel. The struggle is we're then racing the cleanup against the pods spinning back up and requesting new interfaces. If they hit one of the abandoned interfaces before cleanup, things go south.

Also when something like the Mellanox E-Switch is involved, the host doesn't have enough information to safely nuke entries when MACs are being reused.

@blackgold
Copy link
Author

blackgold commented May 28, 2021

We're running into this currently. We have 4-6 interfaces in use by the CNI, but are often finding 1 or 2 left with a bad interface name, and various settings that weren't reverted. The host usually has enough information to fix the handed back/abandoned interfaces after the failed/incomplete cmdDel. The struggle is we're then racing the cleanup against the pods spinning back up and requesting new interfaces. If they hit one of the abandoned interfaces before cleanup, things go south.

Also when something like the Mellanox E-Switch is involved, the host doesn't have enough information to safely nuke entries when MACs are being reused.
@JaseFace
I think it would be ideal to implement a cni plugin that does two things in following order in cmdAdd

  1. Verify the devices on the host are in expected format. If not fix them.
  2. Delegate to sriov-cni

@zshi-redhat
If sriov-cni can provide some ability to run prolog hooks then we can invoke (1) using the prolog hooks.

@martinkennelly
Copy link
Member

martinkennelly commented May 31, 2021

@blackgold

I tried switching to init namespace, however the device is not visible

Do you know why they aren't visible in the init netns? I figured, once the pod netns is deleted, the devices would return to the init netns. SRIOV CNI could detect the pod netns is deleted but continue on and verify the device is in the appropriate state.

@adrianchiris adrianchiris removed the stale This issue did not have any activity nor a conclusion in the past 90 days label Jun 1, 2021
@blackgold
Copy link
Author

blackgold commented Jun 5, 2021

So from within the sriov-cni process when i tried to list devices in the init ns, the devices dont show up. Only after the last cmdDel invocation finishes the devices show up in host ns. ( atleast thats what i remember).
I thought it has something to do with kubelet holding some reference to the container netns.

@zshi-redhat
Copy link
Collaborator

Screen Shot 2020-05-24 at 9 25 03 AM

Kind of using a sriov-cleaner daemon to clean up the device that ends up in bad shape on the host. Looking for suggestions on ideal solution.

Would it be helpful if we do the device health check in the device plugin (when kubelet sends the allocate request to device plugin)? For example, if the requested device is not in init netns (or not in an expected state than it should be during discovery), device plugin would return unhealthy to kubelet which then repeat the allocate process with another device.

@blackgold
Copy link
Author

In our use case the job uses all IB devices for training. If even one device is not healthy the job will not run.
It will be nice to have some facility to fix the device.

@jwolfe-ns
Copy link

As a follow up, our post cmdDel() failure cleanup now resets the VF, which causes all E-Switch entries related to that VF to be removed also. This prevents MAC collisions in the E-Switch, as we are changing them for bonding. Since the host namespace only sees VFs that aren't assigned out, we can 'safely' reset all VFs we see without concern about their state.

Though we still have the race condition where the released VF in a bad state might be assigned out before we clean it.

@YitzyD
Copy link

YitzyD commented Apr 4, 2022

Any updates here? Using the sriov-cni with dhcp ipam seems to exacerbate the issue as well.

@SchSeba
Copy link
Collaborator

SchSeba commented Aug 30, 2022

Hi @YitzyD @blackgold question can you share your pod yaml?

just to be sure are you using terminationGracePeriodSeconds: 0 ?

@SchSeba
Copy link
Collaborator

SchSeba commented Aug 30, 2022

Also with container runtime are you using? I try this with crio and I am not able to reproduce the issue after using #220

@YitzyD
Copy link

YitzyD commented Nov 29, 2022

@SchSeba After investigating further, this seems to be an issue related to docker-shim and as you said, after #220 the issue does seem to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants