Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

veth as phys gets deleted when container fails to start #4409

Open
amateur80lvl opened this issue Mar 7, 2024 · 2 comments
Open

veth as phys gets deleted when container fails to start #4409

amateur80lvl opened this issue Mar 7, 2024 · 2 comments
Milestone

Comments

@amateur80lvl
Copy link

I'm using veth to connect containers without any bridges:

ip link add veth1a type veth peer name veth1b

configuration:

# some real adapter
lxc.net.0.type = phys
lxc.net.0.name = enx
lxc.net.0.link = enx00e053e6155c

# veth as phys
lxc.net.1.type = phys
lxc.net.1.name = ethv
lxc.net.1.link = veth1a

If everything else is okay this woks like a charm. However, if container configuration is wrong, for example, add a nonsense to reproduce:

lxc.hook.mount = xxx

then veth pair gets deleted from the system.

Basically, I suppose such a deletion is a kernel feature which seems to be difficult to work around in LXC, if possible at all.
But it's very annoying to me to restore interfaces each time I make a mistake in config. Plus to restart a container that used the peer. Some simple re-arrangements could make life a bit easier and I'd appreciate to see them implemented in the upstream someday. Below are just suggestions.

The main change is to move lxc_restore_phys_nics_to_netns call to the finalization of lxc_spawn, because it is not called in case of failure and adapters aren't moved back. Physical adapters survive, virtual ones do not.

The next change is invoking lxc_network_recv_name_and_ifindex_from_child immediatelly after sending interface names to the child.
I love the look of counterpart piece after making the change:

ret = lxc_network_recv_from_parent(handler);
if (ret < 0)
        return log_error(-1, "Failed to receive veth names from parent");

ret = lxc_setup_network_in_child_namespaces(lxc_conf);
if (ret < 0)
        return log_error(-1, "Failed to setup network");

ret = lxc_network_send_name_and_ifindex_to_parent(handler);

Everything is in one place. I have a gut feeling this is exact look of LXC code in early days before things get complicated.
However, for proper sequencing the data over sockets I had to move this block before lxc_rootfs_prepare_child.
I don't know if this move has any implications, it's was the simplest way for me. You might have better ideas with sync sugar.

Also note I had to swap indexes of handler->data_sock to make it working, I have no idea why above changes broke it, must be a serious flaw in my patch. Same for ERROR vs SYSERROR and syserror vs syserror_ret vs log_error. Not sure which one is correct in particular places.

Finally, the result of if_nametoindex is not checked in lxc_delete_network_priv, that looks a bit weird, and I haven't seen this

netdev->ifindex = if_nametoindex(netdev->name);
ever returned with success in my setups.

Here's the full patch:

diff --git a/src/lxc/conf.c b/src/lxc/conf.c
index 4b46d24bf..83279b5fe 100644
--- a/src/lxc/conf.c
+++ b/src/lxc/conf.c
@@ -3790,12 +3790,6 @@ int lxc_sync_fds_parent(struct lxc_handler *handler)
 	if (ret < 0)
 		return syserror_ret(ret, "Failed to receive tty info from child process");
 
-	if (container_uses_namespace(handler, CLONE_NEWNET)) {
-		ret = lxc_network_recv_name_and_ifindex_from_child(handler);
-		if (ret < 0)
-			return syserror_ret(ret, "Failed to receive names and ifindices for network devices from child");
-	}
-
 	ret = lxc_recv_console_from_child(handler);
 	if (ret < 0)
 		return syserror_ret(ret, "Failed to receive console from child");
@@ -3820,12 +3814,6 @@ int lxc_sync_fds_child(struct lxc_handler *handler)
 	if (ret < 0)
 		return syserror_ret(ret, "Failed to send tty file descriptors to parent");
 
-	if (container_uses_namespace(handler, CLONE_NEWNET)) {
-		ret = lxc_network_send_name_and_ifindex_to_parent(handler);
-		if (ret < 0)
-			return syserror_ret(ret, "Failed to send network device names and ifindices to parent");
-	}
-
 	ret = lxc_send_console_to_parent(handler);
 	if (ret < 0)
 		return syserror_ret(ret, "Failed to send console to parent");
@@ -3862,6 +3850,20 @@ int lxc_setup(struct lxc_handler *handler)
 	const char *lxcpath = handler->lxcpath, *name = handler->name;
 	struct lxc_conf *lxc_conf = handler->conf;
 
+	if (container_uses_namespace(handler, CLONE_NEWNET)) {
+		ret = lxc_network_recv_from_parent(handler);
+		if (ret < 0)
+			return log_error(-1, "Failed to receive veth names from parent");
+
+		ret = lxc_setup_network_in_child_namespaces(lxc_conf);
+		if (ret < 0)
+			return log_error(-1, "Failed to setup network");
+
+		ret = lxc_network_send_name_and_ifindex_to_parent(handler);
+		if (ret < 0)
+			return syserror_ret(ret, "Failed to send network device names and ifindices to parent");
+	}
+
 	ret = lxc_rootfs_prepare_child(handler);
 	if (ret < 0)
 		return syserror("Failed to prepare rootfs");
@@ -3882,16 +3884,6 @@ int lxc_setup(struct lxc_handler *handler)
 			return log_error(-1, "Failed to setup container keyring");
 	}
 
-	if (container_uses_namespace(handler, CLONE_NEWNET)) {
-		ret = lxc_network_recv_from_parent(handler);
-		if (ret < 0)
-			return log_error(-1, "Failed to receive veth names from parent");
-
-		ret = lxc_setup_network_in_child_namespaces(lxc_conf);
-		if (ret < 0)
-			return log_error(-1, "Failed to setup network");
-	}
-
 	if (lxc_conf->autodev > 0) {
 		ret = mount_autodev(name, &lxc_conf->rootfs, lxc_conf->autodevtmpfssize, lxcpath);
 		if (ret < 0)
diff --git a/src/lxc/network.c b/src/lxc/network.c
index 09666a76c..14c909952 100644
--- a/src/lxc/network.c
+++ b/src/lxc/network.c
@@ -3653,6 +3653,9 @@ static bool lxc_delete_network_priv(struct lxc_handler *handler)
 		 * containers network namespace, update the ifindex.
 		 */
 		netdev->ifindex = if_nametoindex(netdev->name);
+                if (0 == netdev->ifindex) {
+			WARN("Failed to update interface index for %s/%s: %s", netdev->name, netdev->link, strerror(errno));
+		}
 
 		/* Delete l2proxy entries if enabled and used with a link property */
 		if (netdev->l2proxy && !is_empty_string(netdev->link)) {
@@ -3666,6 +3669,7 @@ static bool lxc_delete_network_priv(struct lxc_handler *handler)
 			 * with their transient name to avoid collisions
 			 */
 			netdev->ifindex = if_nametoindex(netdev->transient_name);
+			// XXX if (0 == netdev->ifindex) ???
 			ret = lxc_netdev_rename_by_index(netdev->ifindex, netdev->link);
 			if (ret < 0)
 				WARN("Failed to rename interface with index %d "
@@ -4132,7 +4136,7 @@ int lxc_network_recv_from_parent(struct lxc_handler *handler)
 
 int lxc_network_send_name_and_ifindex_to_parent(struct lxc_handler *handler)
 {
-	int data_sock = handler->data_sock[0];
+	int data_sock = handler->data_sock[1];
 	struct lxc_netdev *netdev;
 	struct list_head *netdevs = &handler->conf->netdevs;
 
@@ -4165,7 +4169,7 @@ int lxc_network_send_name_and_ifindex_to_parent(struct lxc_handler *handler)
 
 int lxc_network_recv_name_and_ifindex_from_child(struct lxc_handler *handler)
 {
-	int data_sock = handler->data_sock[1];
+	int data_sock = handler->data_sock[0];
 	struct lxc_netdev *netdev;
 
 	if (!handler->am_root)
diff --git a/src/lxc/start.c b/src/lxc/start.c
index 8daa7d73c..a5e9a6d99 100644
--- a/src/lxc/start.c
+++ b/src/lxc/start.c
@@ -1865,31 +1865,36 @@ static int lxc_spawn(struct lxc_handler *handler)
 	if (!lxc_sync_wake_child(handler, START_SYNC_POST_CONFIGURE))
 		goto out_delete_net;
 
-	ret = lxc_rootfs_prepare_parent(handler);
-	if (ret) {
-		ERROR("Failed to prepare rootfs");
-		goto out_delete_net;
-	}
-
 	if (container_uses_namespace(handler, CLONE_NEWNET)) {
 		ret = lxc_network_send_to_child(handler);
 		if (ret < 0) {
 			SYSERROR("Failed to send veth names to child");
 			goto out_delete_net;
 		}
+		ret = lxc_network_recv_name_and_ifindex_from_child(handler);
+		if (ret < 0) {
+			SYSERROR("Failed to receive names and ifindices for network devices from child");
+			goto out_restore_net;  // or delete???
+		}
+	}
+
+	ret = lxc_rootfs_prepare_parent(handler);
+	if (ret) {
+		ERROR("Failed to prepare rootfs");
+		goto out_restore_net;
 	}
 
 	if (!lxc_sync_wait_child(handler, START_SYNC_IDMAPPED_MOUNTS))
-		goto out_delete_net;
+		goto out_restore_net;
 
 	ret = lxc_idmapped_mounts_parent(handler);
 	if (ret) {
 		ERROR("Failed to setup mount entries");
-		goto out_delete_net;
+		goto out_restore_net;
 	}
 
 	if (!lxc_sync_wait_child(handler, START_SYNC_CGROUP_LIMITS))
-		goto out_delete_net;
+		goto out_restore_net;
 
 	/*
 	 * With isolation the limiting devices cgroup was already setup, so
@@ -1898,13 +1903,13 @@ static int lxc_spawn(struct lxc_handler *handler)
 	if (!handler->conf->cgroup_meta.namespace_dir &&
 	    !cgroup_ops->setup_limits_legacy(cgroup_ops, handler->conf, true)) {
 		ERROR("Failed to setup legacy device cgroup controller limits");
-		goto out_delete_net;
+		goto out_restore_net;
 	}
 	TRACE("Set up legacy device cgroup controller limits");
 
 	if (!cgroup_ops->devices_activate(cgroup_ops, handler)) {
 		ERROR("Failed to setup cgroup2 device controller limits");
-		goto out_delete_net;
+		goto out_restore_net;
 	}
 	TRACE("Set up cgroup2 device controller limits");
 
@@ -1915,11 +1920,11 @@ static int lxc_spawn(struct lxc_handler *handler)
 	ret = run_lxc_hooks(name, "start-host", conf, NULL);
 	if (ret < 0) {
 		ERROR("Failed to run lxc.hook.start-host");
-		goto out_delete_net;
+		goto out_restore_net;
 	}
 
 	if (!lxc_sync_wake_child(handler, START_SYNC_FDS))
-		goto out_delete_net;
+		goto out_restore_net;
 
 	if (handler->ns_unshare_flags & CLONE_NEWCGROUP) {
 		/* Now we're ready to preserve the cgroup namespace */
@@ -1927,7 +1932,7 @@ static int lxc_spawn(struct lxc_handler *handler)
 		if (ret < 0) {
 			if (ret != -ENOENT) {
 				SYSERROR("Failed to preserve cgroup namespace");
-				goto out_delete_net;
+				goto out_restore_net;
 			}
 		}
 	}
@@ -1938,7 +1943,7 @@ static int lxc_spawn(struct lxc_handler *handler)
 		if (ret < 0) {
 			if (ret != -ENOENT) {
 				SYSERROR("Failed to preserve time namespace");
-				goto out_delete_net;
+				goto out_restore_net;
 			}
 		}
 	}
@@ -1946,13 +1951,13 @@ static int lxc_spawn(struct lxc_handler *handler)
 	ret = lxc_sync_fds_parent(handler);
 	if (ret < 0) {
 		SYSERROR("Failed to sync file descriptors with child");
-		goto out_delete_net;
+		goto out_restore_net;
 	}
 
 	ret = lxc_terminal_setup(conf);
 	if (ret < 0) {
 		SYSERROR("Failed to create console");
-		goto out_delete_net;
+		goto out_restore_net;
 	}
 
 	/*
@@ -1963,7 +1968,7 @@ static int lxc_spawn(struct lxc_handler *handler)
 	 * different value, causing us to error out).
 	 */
 	if (!lxc_sync_barrier_child(handler, START_SYNC_READY_START))
-		goto out_delete_net;
+		goto out_restore_net;
 
 	/* Now all networks are created, network devices are moved into place,
 	 * and the correct names and ifindices in the respective namespaces have
@@ -1986,6 +1991,9 @@ static int lxc_spawn(struct lxc_handler *handler)
 
 	return 0;
 
+out_restore_net:
+	lxc_restore_phys_nics_to_netns(handler);
+
 out_delete_net:
 	if (container_uses_namespace(handler, CLONE_NEWNET))
 		lxc_delete_network(handler);
@@ -2183,7 +2191,6 @@ out_detach_blockdev:
 
 out_delete_network:
 	lxc_abort(handler);
-	lxc_restore_phys_nics_to_netns(handler);
 	lxc_delete_network(handler);
 	detach_block_device(handler->conf);
 	lxc_end(handler);
@stgraber stgraber added this to the lxc-6.0.0 milestone Mar 26, 2024
@mihalicyn
Copy link
Member

Hi @amateur80lvl !

  1. Have you tried to use lxc.net.X.type = veth ? In this case LXC will create veth-pair for you. You can instruct LXC to add this veth device in the bridge (use something like lxc.net.X.link = br0).

  2. your veth pair is getting removed because veth device can only exist when it has a pair. If one of devices gets destroyed then second device is destroyed too (https://github.com/torvalds/linux/blob/026e680b0a08a62b1d948e5a8ca78700bfac0e6e/drivers/net/veth.c#L1916)

@stgraber stgraber modified the milestones: lxc-6.0, lxc-6.1 Apr 2, 2024
@amateur80lvl
Copy link
Author

Hi @mihalicyn,

thanks for the help, but

1. Have you tried to use `lxc.net.X.type = veth` ?

Honestly, I haven't. If I went this way one container would be the creator of veth pair and the other would use it. Too weird dependency to easily handle especially with many containers. What if I need to restart the creator and don't want peers to notice this? Pre-created pairs make things easier.

2. your veth pair is getting removed because veth device can only exist when it has a pair.

Basically that's not a problem, The problem is they gets deleted by lxc-start when they should not.

I feel myself stumbled upon an undocumented (unless I missed the doc, that's quite possible) interprocess protocol which is fragile and my attempt to make changes led to weird things such as the need for swapping socket pairs. If I worked at work I'd be forced to review, write documentation and/or comments, refactor the code, and then try fixing the problem. I tried the latter in the first place and this only spotted the problem. My personal workaround for now is not to make mistakes in the configuration files when I use veth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants