[Experiment] Explore running multiple containers in a shared VM #3658

eriknordmark · 2023-12-09T05:28:53Z

For sidecar containers it would be useful to be able to run them in the same VM as the main container.
This is an experiment to see whether that can be done without any API changes by looking for multiple OCI-based volumes for a single app instance, and kicking off the EntryPoint for each one of them.

If this works it might be a useful stepping stone to get to more complete standard runtime for multiple containers in one VM.

With this PR I can create an app instance which has two OCI images (by example was the unmodified nginx and sshd containers from docker.io). The sshd and nginx run with chroot isolation and otherwise share everything, which matches the intended use case of closely cooperating and trust between a side car container and the main container.

Note that the commits in this PR needs to be cleaned up and squashed.

codecov · 2023-12-09T05:39:39Z

Codecov Report

Attention: 109 lines in your changes are missing coverage. Please review.

Comparison is base (3db87c6) 19.86% compared to head (6b09142) 19.87%.

Files	Patch %	Lines
pkg/pillar/hypervisor/xen.go	0.00%	45 Missing ⚠️
pkg/pillar/hypervisor/kvm.go	34.04%	31 Missing ⚠️
pkg/pillar/cmd/domainmgr/domainmgr.go	0.00%	26 Missing ⚠️
pkg/pillar/hypervisor/containerd.go	0.00%	6 Missing ⚠️
pkg/pillar/containerd/oci.go	90.90%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3658   +/-   ##
=======================================
  Coverage   19.86%   19.87%           
=======================================
  Files         231      231           
  Lines       51063    51160   +97     
=======================================
+ Hits        10143    10167   +24     
- Misses      40179    40253   +74     
+ Partials      741      740    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

deitch · 2023-12-11T10:13:11Z

This is an experiment to see whether that can be done without any API changes by looking for multiple OCI-based volumes for a single app instance, and kicking off the EntryPoint for each one of them

How did you manage this? I see you switched from an assumption that type DomainStatus is a single container, with OCIConfigDir, and instead contains a list of containers ContainerList []DomainContainerStatus, each of which has OCIConfigDir and ContainerIndex and FileLocation.

What I don't get is:

how did you manage to populate that without changing the AppInstanceConfig which has a single VmConfig? I guess it does have repeated Drive, but you still would need the entrypoint for each, as well as other info?

deitch · 2023-12-11T10:13:47Z

Separately, what is the effort of this vs seeing if kata or k3s would be less effort?

uncleDecart · 2023-12-11T12:41:33Z

Separately, what is the effort of this vs seeing if kata or k3s would be less effort?

@deitch from what I saw kata supports KVM, but I didn't see other type 1 hypervisor (like Xen) but even if we do thing that is only supported on KVM, kata creates VM instance with which it communicates via virtio. So we will have to have at least one VM to run containers.
Alternative which I also explored was microkernels like unikraft. In that case you will need to support this pipeline of creating specific VM and worry about porting libraries to it, when you need them. Fair point that for sidecar container we don't need much, but still we will have to maintain that. I believe that easiest way to have less footprint with running more containers will be to run them in one VM, rest are good options, but will require significantly more time to research and develop.

Edit: there's also microvms in KVM, which might reduce footprint

rene · 2023-12-11T14:01:23Z

pkg/xen-tools/initrd/init-initrd

+    mount -t tmpfs -o nodev,nosuid,noexec,size=20% shm "$MNT"/rootfs/dev/shm
+    mount -t tmpfs -o nodev,nosuid,size=20% tmp "$MNT"/rootfs/tmp
+    mount -t mqueue -o nodev,nosuid,noexec none "$MNT"/rootfs/dev/mqueue
+    ln -s /proc/self/fd "$MNT"/rootfs/dev/fd


Sharing all these descriptors with all containers wouldn't mess things up? Without a mux we will have mixed outputs on stdout, which might not be critical for now, but what about the stdin? Do we care about it?

The use case for this experiment is when some sharing is ok, but I'll look at the list and see what makes sense to separate.
If the direction in this PR is useful (with its limitations) it might be a useful stepping stone to running a collection of containers (a pod) in a VM using something existing like kata or k3s.

deitch · 2023-12-11T14:30:38Z

kata creates VM instance with which it communicates via virtio. So we will have to have at least one VM to run containers.

If we have no choice, then ok. I just am so wary of yet again creating something that looks a lot like some other OSS project or library, but we do just a little bit differently.

eriknordmark · 2023-12-12T00:58:54Z

how did you manage to populate that without changing the AppInstanceConfig which has a single VmConfig? I guess it does have repeated Drive, but you still would need the entrypoint for each, as well as other info?

The current API allows specifying any number of virtual disks, whether they are OCI or images.
We currently don't do anything with the EntryPoint etc in anything but the first OCI image.
This experiment uses that.
So the single vmconfig still has to specify the CPU, memory, adapters, direct attach, etc.
But each OCI image has its own enviroment, user/group, etc.

deitch · 2023-12-12T09:12:17Z

The current API allows specifying any number of virtual disks, whether they are OCI or images.
We currently don't do anything with the EntryPoint etc in anything but the first OCI image.
This experiment uses that.
So the single vmconfig still has to specify the CPU, memory, adapters, direct attach, etc.
But each OCI image has its own enviroment, user/group, etc

That is how you did it. Now I see it inside. That is rather nicely done. It feels a bit swimming against the stream - Kubernetes has a native "multiple containers together (i.e. Pod)" concept - but we do as we need.

eriknordmark · 2023-12-12T16:23:14Z

The current API allows specifying any number of virtual disks, whether they are OCI or images.
We currently don't do anything with the EntryPoint etc in anything but the first OCI image.
This experiment uses that.
So the single vmconfig still has to specify the CPU, memory, adapters, direct attach, etc.
But each OCI image has its own enviroment, user/group, etc

That is how you did it. Now I see it inside. That is rather nicely done. It feels a bit swimming against the stream - Kubernetes has a native "multiple containers together (i.e. Pod)" concept - but we do as we need.

From an implementation perspective I expect this to go away (together with the rest of the init-initrd scripts in pkg/xen-tools) once we find and integrate a standard runtime for all of this. And that should presumably give us the ability to specify e.g., volume and network resources for the containers inside the pod VM. So a stepping stone from a functional perspective, and a limited amount of throw-away code.

Signed-off-by: eriknordmark <erik@zededa.com>

Add support for /mnt%d per OCI Signed-off-by: eriknordmark <erik@zededa.com>

Signed-off-by: eriknordmark <erik@zededa.com>

shjala · 2024-01-29T14:02:18Z

I have some security concerns about running multiple apps in one VM without mapping each container to a separate user, and the fact that we are not setting up namespaces like mount and other. I'll write a longer comment soon.

eriknordmark · 2024-02-13T20:01:21Z

I have some security concerns about running multiple apps in one VM without mapping each container to a separate user, and the fact that we are not setting up namespaces like mount and other. I'll write a longer comment soon.

@shjala
The assumption is that this will only be used by sidecar containers which do need full access hence no namespace isolation from the main container. I don't know if we can enforce that by checking the content of the container - need to look at the test case we used.

eriknordmark requested review from rene, rouming, deitch and milan-zededa as code owners December 9, 2023 05:28

eriknordmark marked this pull request as draft December 9, 2023 05:28

github-actions bot requested review from jsfakian, OhmSpectator, rucoder, shjala and uncleDecart December 9, 2023 05:29

rene reviewed Dec 11, 2023

View reviewed changes

github-actions bot requested a review from rene December 12, 2023 00:20

eriknordmark added 10 commits January 12, 2024 22:10

XXX add logs for OCI virtual disks

daf3696

Signed-off-by: eriknordmark <erik@zededa.com>

Introduce DomainContainerStatus to track multiple OCIs per app instance

38d1dec

Add support for /mnt%d per OCI Signed-off-by: eriknordmark <erik@zededa.com>

Add max_oci kernel bootarg

de50a9e

Signed-off-by: eriknordmark <erik@zededa.com>

Do separate 9P mounts

0e46f08

Signed-off-by: eriknordmark <erik@zededa.com>

XXX revert Format change

50f07a6

Signed-off-by: eriknordmark <erik@zededa.com>

XXX change DiskStatus to use /mnt

5b93d5d

Signed-off-by: eriknordmark <erik@zededa.com>

fix spec0 panic

bf41d4d

Signed-off-by: eriknordmark <erik@zededa.com>

No max_oci for VM

28e515b

Signed-off-by: eriknordmark <erik@zededa.com>

Fix share_dir%d offset

2d1d942

Signed-off-by: eriknordmark <erik@zededa.com>

Using ContainerIndex

f88e917

Signed-off-by: eriknordmark <erik@zededa.com>

eriknordmark added 8 commits January 12, 2024 22:10

initrd: agetty fix

174eb3c

Signed-off-by: eriknordmark <erik@zededa.com>

init-initrd: add logmsg

ed85ec0

Signed-off-by: eriknordmark <erik@zededa.com>

XXX hack to create /mnt1 on host

e2aced9

Signed-off-by: eriknordmark <erik@zededa.com>

kvm.go yetus

8349bd1

Signed-off-by: eriknordmark <erik@zededa.com>

init-initrd: yetus

1c62067

Signed-off-by: eriknordmark <erik@zededa.com>

fix FileLocation for container 9p mount

afa375f

Signed-off-by: eriknordmark <erik@zededa.com>

chroot2: need to disable TIOCSCTTY for second container

cb25358

Signed-off-by: eriknordmark <erik@zededa.com>

init-initrd: fix LOGFILE

6b09142

Signed-off-by: eriknordmark <erik@zededa.com>

eriknordmark force-pushed the checkpoint branch from b27c041 to 6b09142 Compare January 13, 2024 08:58

uncleDecart mentioned this pull request Jan 31, 2024

Fix user apps test suite lf-edge/eden#944

Open

europaul self-requested a review February 13, 2024 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experiment] Explore running multiple containers in a shared VM #3658

[Experiment] Explore running multiple containers in a shared VM #3658

eriknordmark commented Dec 9, 2023 •

edited

codecov bot commented Dec 9, 2023 •

edited

deitch commented Dec 11, 2023

deitch commented Dec 11, 2023

uncleDecart commented Dec 11, 2023 •

edited

rene Dec 11, 2023

eriknordmark Dec 12, 2023

deitch commented Dec 11, 2023

eriknordmark commented Dec 12, 2023

deitch commented Dec 12, 2023

eriknordmark commented Dec 12, 2023

shjala commented Jan 29, 2024

eriknordmark commented Feb 13, 2024

[Experiment] Explore running multiple containers in a shared VM #3658

Are you sure you want to change the base?

[Experiment] Explore running multiple containers in a shared VM #3658

Conversation

eriknordmark commented Dec 9, 2023 • edited

codecov bot commented Dec 9, 2023 • edited

Codecov Report

deitch commented Dec 11, 2023

deitch commented Dec 11, 2023

uncleDecart commented Dec 11, 2023 • edited

rene Dec 11, 2023

Choose a reason for hiding this comment

eriknordmark Dec 12, 2023

Choose a reason for hiding this comment

deitch commented Dec 11, 2023

eriknordmark commented Dec 12, 2023

deitch commented Dec 12, 2023

eriknordmark commented Dec 12, 2023

shjala commented Jan 29, 2024

eriknordmark commented Feb 13, 2024

eriknordmark commented Dec 9, 2023 •

edited

codecov bot commented Dec 9, 2023 •

edited

uncleDecart commented Dec 11, 2023 •

edited