Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPAM failure causes Delete Loop after the introduction of OVS-Port Cleanup on IPAM failure in ADD workflow #276

Closed
joelcheison opened this issue Aug 25, 2023 · 3 comments

Comments

@joelcheison
Copy link

joelcheison commented Aug 25, 2023

Issue:
On IPAM failure while using OVS-CNI, the pod goes into a delete loop [Stuck in either ContainerCreating or Init phase].
On closer inspection, the cause is the fact that cmdDel keeps failing and the CRI retries the delete, which ends up in a loop.
There is no other way than to force delete the pod.

Logs:
Warning FailedKillPod 3m47s (x38 over 24m) kubelet error killing pod: failed to "KillPodSandbox" for "c5c138c8-3f9c-4b99-a9b2-29e8a5e19158" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox "2b7fd1a2d31f0568b8d11866b1d0274248210c4b429914a6a1ea02ab362a4d11": plugin type="multus" name="multus-cni-network" failed (delete): delegateDel: error invoking DelegateDel - "ovs": error in getting result from DelNetwork: Failed to obtain OVS port for given connection: failed to find object from table Port"

What is happening:
During CmdAdd, IPAM fails and the defer function to clean up ports is invoked which deletes the OVS port.
The sandbox will be killed and recreated. During the CmdDel, the port cleanup function tries to find the port for cleanup.
The port is not found[as it was deleted by the defer function in CmdAdd]. The issue arises at this point. The function that tries to find the port has 3 returns: port, portfound and error. The function however returns error when port is not found, it should only return portfound as false and error as nil. Error is to be returned only when the function was not able to determine the existence of the port.

There is no scenario where the " func (ovsd *OvsDriver) GetOvsPortForContIface(contIface, contNetnsPath string) (string, bool, error) " function returns ["", false,nil], to indicate port not found [source: https://github.com/k8snetworkplumbingwg/ovs-cni/blob/main/pkg/ovsdb/ovsdb.go]

The following code for CmdDel in [https://github.com/k8snetworkplumbingwg/ovs-cni/blob/main/pkg/plugin/plugin.go] expects otherwise

	portName, portFound, err := getOvsPortForContIface(ovsDriver, args.IfName, args.Netns)
	if err != nil {
		return fmt.Errorf("Failed to obtain OVS port for given connection: %v", err)
	}

Returns error above

	// Do not return an error if the port was not found, it may have been
	// already removed by someone
	if portFound {
		if err := removeOvsPort(ovsDriver, portName); err != nil {
			return err
		}
	}

Due to this behavior of "func (ovsd *OvsDriver) GetOvsPortForContIface", if the port does not exist during CmdDel[one way is the defer function in CmdAdd deleting the port], CmdDel fails. The CRI will iteratively try to Kill the sandbox, each attempt calls the CmdDel which fails and this causes a delete loop.

@phoracek
Copy link
Member

Hello @joelcheison, thanks for reporting this. CNI DEL should never fail, so the current behavior is clearly wrong. Since you were already able to pinpoint the source of the problem, would you be willing to post a fix?

@ykulazhenkov
Copy link
Contributor

Should be fixed by #305

@phoracek
Copy link
Member

Thanks @ykulazhenkov! Closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants