Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does "Pod Sandbox" mean to Aurae? #433

Open
krisnova opened this issue Feb 26, 2023 · 12 comments
Open

What does "Pod Sandbox" mean to Aurae? #433

krisnova opened this issue Feb 26, 2023 · 12 comments
Assignees

Comments

@krisnova
Copy link
Contributor

How did we get here?

So I was streaming recently and starting to look through the implementation detail of how we are implementing the container runtime interface, CRI.

Naturally this opened up a can of worms. One implementation detail lead to another, and it quickly spiraled out of control. This left me spending the weekend thinking to myself about what the project should do. I want this GitHub issue to serve as a decision/architecture (ADR) for the detail of what we intend to do.

But first, some context, history, and vocabulary.

What is a "Pod Sandbox" and where did it come from?

Here is the shape that I think of when I think of what most of the industry refers to as "a pod".

pod/
├── container-app
│   └── bin
│       └── my-app
├── container-log-aggregate
│   └── app
│       ├── logger.go
│       └── main.go
├── container-profiler
│   └── program.exe
└── container-proxy
    └── bin
        └── nginx

which is basically to say that its a bounded set of containers that exist within some isolation zone. Kubernetes, for example, likes to pretend that the containers within a pod all share the same localhost, storage, network, etc.

In the context of OpenShift sandboxed containers, a pod is implemented as a virtual machine. Several containers can run in the same pod on the same virtual machine.[1]

The history of a Pod (as I understand it) is relatively simple, and makes sense given the behavior of the clone(2) and clone3(2) system calls. Basically you cannot "create" a new namespace in Linux. You can however, execute a new process in a new namespace. So what do you do when you just want an "empty" boundary and aren't ready to start any work in your namespace yet? Or more importantly, how do you keep the namespace around if your container exits? Linux will destroy namespaces if there is no longer something executing in the namespace.

There is some historical context that mentions that the Kubernetes Pause Container was the answer to this problem.

  1. A user executes clone(2) or clone3(2) with a new Pause process.
  2. The new process gets a pid and a shiny new set of namespaces, and basically just falls asleep and does nothing.
  3. Now that the namespaces are established, we can schedule and reschedule other processes alongside each other in the new namespaces.

Thus, the paradigm of the pod sandbox was created as a way to hold a set of these containers together.

Here are some more resources:

Option 1) A Pod is a VM

This is a straightforward proposal and can be a viable and powerful path for Aurae to adopt.

Basically we follow suit with OpenShift, Gvisor, and Firecracker and establish a virtualization zone (basically a VM) for every pod by default.

Once the VM has been started we can delegate out to the nested auraed run a container using our own RPC. The containers can share the same namespaces as the host, and we can mount volumes between them, communicate over the local network, etc, etc. We can bake in more logic (such as network devices) as well in the future.

Implementation would look like:

  1. We finalize our decision on a VM software (I think I am leaning towards KVM) and schedule an auraed VM.
  2. We connect to the nested auraed over the network, and schedule a container using a new RPC such as RunContainer().
  3. We persist the VM regardless of workload and containers become mutable. The user destroys the VM when they are done.

Option 2) A pod is a container, and we spritz up our cells

In this option we would need to do 2 things.

  1. Schedule a pod as a plain-ol-container.
  2. Establish the ability to be able to "install" a tarball into the container filesystem using a new feature in the cells service.

This option is attractive because it solves the package management and supply chain concerns because everything becomes a tarball/OCI image at the end of the day.

Basically we would create a new Youki container with a nested auraed running as an init process. Then we can access the auraed RPC for cells, and send an OCI image to the cell service to un-tar the image and "install" it as we would with any package manager. This kind of violates the entire supply chain guarantee and image immutability thing that everyone seems to love about containers so I am not sure this is a good approach. However this also feels a lot more intuitive to anyone used to systemd and bare metal machines.

This approach would involve a new RPC for the cell service that allows the user to pass a remote URL for an OCI image/tarball for the cell service to download and install. The cells would be created inside the container, and they could just do what they needed.

One thing to figure out would be the need to chroot each cell filesystem, as otherwise we have no way of preventing 2 "containers" from sharing the same files/paths/directory structure. The fact that we would need to chroot each cell (that the user would be calling a container) is a red flag.

Option 3) A pod is a container, and your containers are also containers

Basically we create a new auraed container when a user creates a new pod sandbox. We establish new namespaces for the new container. When it comes time to schedule a nested container inside the new pod sandbox we call out to the nested auraed and say "RunContainer" but just use the namespaces from the first pod.

The output here would be a node with a LOT of containers floating around, all with a "virtual" structure. In other words we would have a flat list of containers from the host's perspective and the structure and isolation is only enforced by how we expose namespaces to containers.

This feels... wrong.. I can't explain why... I believe this is how a lot of container runtimes do things now and it just seems to be an anti pattern as we could actually build recursive isolation boundaries.

The Decision

I want some help talking through the decision -- We can close the issue once we have come to conviction internally.

@jpetazzo
Copy link

jpetazzo commented Feb 26, 2023

Hi,

Just to add a tiny detail, in case that's helpful, about this:

how do you keep the namespace around if your container exits? Linux will destroy namespaces if there is no longer something executing in the namespace

You can "save" a namespace by bind-mounting its pseudo-file (the thing found in /proc/<PID>/ns).

For instance, let's create a namespace and configure a loopback interface in it, with a specific IP address so we can identify it later:

# unshare --net
# ifconfig lo 127.42
# echo $$ > /tmp/mypid
# cat /tmp/mypid 
28171

Now, "save" the namespace by bind-mounting it:

# touch /tmp/mynetns
# mount -o bind /proc/$$/ns/net /tmp/mynetns

Leave the namespace, check that we're "outside" and that the process that we created is gone:

# exit
logout
# ifconfig lo
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
[...]
# kill -CONT $(cat /tmp/mypid)
-bash: kill: (28171) - No such process

Then re-enter that namespace thanks to the bind-mount of the pseudo-file:

# nsenter --net=/tmp/mynetns
# ifconfig lo
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.42  netmask 255.0.0.0
[...]

I don't know for sure why Kubernetes is using the "pod sandbox" concept. Perhaps because the Docker API doesn't expose anything to manipulate these network namespaces directly, and these pause containers were as good as anything else to do that. And/or perhaps the pause containers are a place as good as any to do zombie reaping (when sharing the PID namespace).

I don't have a particularly strong opinion on your question, though (I'm just watching the space from the sidelines with excitement :))

Edit: I was mentioning the bind-mount possibility just in case that unlocks interesting options in your scenario, i.e. the possibility of preserving a network namespace on its own; without requiring a process - which itself might require a Cell or leak some other abstraction.

@gabriel-samfira
Copy link

You probably already know about this, but if you plan to go the VM route, https://github.com/rust-vmm might be useful.

@krisnova
Copy link
Contributor Author

Thank you @jpetazzo and @gabriel-samfira! This is great feedback!

@jpetazzo this is useful context, I had no idea you could "park" a namespace just by bind mounting the pseudo namespace file. This will surely come in handy one day 😉 and I also suspect this will be one of those issues that serves as hidden knowledge that folks will discover while looking for examples of how to save a namespace. The only way we will know how many people find this useful in the coming years will be by folks leaving emojis on the thread to show us.

As far the decision goes I am pretty convinced on Option 1 and will be looking at this more on Twitch today.

Maybe a better set of questions:

  • Should we have a "default" and a "fallback" mode? I doubt we can guarantee that the state of the system will always be "Sure bruh.. go ahead and kick off a virtual machine" for every pod. Maybe we default Option 1) and fallback to Option 3)?
  • What features are we optimizing on? I always wanted Aurae to be boring and secure, and honestly if that is the case than it means a lightweight VM pod with a set of happy containers running inside is the way to go.

@krisnova
Copy link
Contributor Author

krisnova commented Feb 26, 2023

Okay so we are going to perform a small experiment to validate my theory that we can run Option 1) as a default and fall back to Option 3)

Hypothesis

I believe it should be possible for 2 containers to share a namespace (specifically a network namespace) with Youki without making any changes to the code.

I also believe we should be able to re-create the "Kubernetes Pod" experience directly in a VM with lightweight containers and some basic understanding of how chroot works.

@krisnova
Copy link
Contributor Author

Results of the experiment

I was able to run an nginx container with the youki runtime.

Expand details for raw config.json for youki

{ "ociVersion": "1.0.2-dev", "root": { "path": "rootfs", "readonly": false }, "mounts": [ { "destination": "/var/log", "type": "bind", "source": "/var/log", "options": [ "rbind", "rw" ] }, { "destination": "/tmp", "type": "tmpfs", "source": "tmpfs" }, { "destination": "/proc", "type": "proc", "source": "proc" }, { "destination": "/dev", "type": "tmpfs", "source": "tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" ] }, { "destination": "/dev/pts", "type": "devpts", "source": "devpts", "options": [ "nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5" ] }, { "destination": "/dev/shm", "type": "tmpfs", "source": "shm", "options": [ "nosuid", "noexec", "nodev", "mode=1777", "size=65536k" ] }, { "destination": "/dev/mqueue", "type": "mqueue", "source": "mqueue", "options": [ "nosuid", "noexec", "nodev" ] }, { "destination": "/sys", "type": "sysfs", "source": "sysfs", "options": [ "nosuid", "noexec", "nodev", "ro" ] }, { "destination": "/sys/fs/cgroup", "type": "cgroup", "source": "cgroup", "options": [ "nosuid", "noexec", "nodev", "relatime", "ro" ] } ], "process": { "terminal": false, "user": { "uid": 0, "gid": 0 }, "args": [ "nginx", "-g", "daemon off;" ], "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "TERM=xterm" ], "cwd": "/", "capabilities": { "bounding": [ "CAP_SETUID", "CAP_SETGID", "CAP_CHOWN", "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ], "effective": [ "CAP_SETUID", "CAP_SETGID", "CAP_CHOWN", "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ], "inheritable": [ "CAP_SETUID", "CAP_CHOWN", "CAP_SETGID", "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ], "permitted": [ "CAP_SETUID", "CAP_SETGID", "CAP_CHOWN", "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ], "ambient": [ "CAP_SETUID", "CAP_SETGID", "CAP_CHOWN", "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ] }, "rlimits": [ { "type": "RLIMIT_NOFILE", "hard": 1024, "soft": 1024 } ], "noNewPrivileges": true }, "hostname": "nginx", "annotations": {}, "linux": { "resources": { "devices": [ { "allow": true, "type": null, "major": null, "minor": null, "access": "rwm" } ] }, "namespaces": [ { "type": "pid" }, { "type": "ipc" }, { "type": "uts" }, { "type": "mount" } ], "maskedPaths": [ "/proc/acpi", "/proc/asound", "/proc/kcore", "/proc/keys", "/proc/latency_stats", "/sys/firmware", "/proc/scsi" ], "readonlyPaths": [ "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger" ] } }

In this experiment I was able to share the network namespace with the host by removing the configuration from process.namespaces in the config.json
Additionally I was able to bind mount the /var/log directory with the host.
Additionally I was able to mutate the bundle filesystem directly and show the updated content via the nginx dashboard.

Procedure

Run the nginx container with youki, and verify the experiment.

mkdir nginx
cd nginx
# Copy the config.json above to here
mkdir rootfs
sudo -E docker create --name nginx nginx
sudo -E docker export nginx | sudo -E tar -C rootfs -xf -
sudo -E youki run -b . nginx
netstat -tlpn | grep "80" # Verify that nginx is communicated on the host namespace
tail -f /var/log/nginx/*  # Verify that the bind mount is working with the host
emacs rootfs/usr/share/nginx/html/index.html # Edit the nginx hello world and view localhost:80

Then cleanup with youki

sudo -E youki kill nginx SIGKILL
sudo -E youki delete nginx

@krisnova
Copy link
Contributor Author

The result of the experiment has me convinced of our approach moving forward.

Decision

I am making the decision to pursue Option 1. All Aurae Pod Sandboxes will run as a lightweight virtual machine with auraed as pid 1. All containers in the sandbox will have the following characteristics by default:

  • All guest containers in a pod sandbox will share the same network namespace as the sandbox auraed pid 1.
  • All guest containers in a pod sandbox will be free to expose Linux capabilities that will only impact their relationship with the sandbox (not with the "true" host).
  • All guest containers will unshare the pid, ipc, uts, mount namespaces from the sandbox.
  • All guest containers will follow the "normal" Aurae parlance, and create accessible bundles in /var/run/aurae/bundles or wherever the daemon is configured.

In the event that virtualization is not available we fallback on the "flat container" model described in Option 3.

Implications

Each pod gets its own kernel.

Each pod gets its own set of network devices.

Each pod gets its own guest auraed running as pid 1.

@MalteJ
Copy link
Contributor

MalteJ commented Feb 28, 2023

All guest containers in a pod sandbox will share the same network namespace as the sandbox auraed pid 1.

What's the reasoning behind this? Do you want to enable applications to interact with auraed?
We will probably lose one (well-known) port for the auraed, which then cannot be used by the application.

The alternative would be to create a separate namespace within the VM for the container pod. And then auraed would need to take care to route the traffic from the container pod using eBPF or something similar. Maybe this would provide more flexibility?

@krisnova
Copy link
Contributor Author

krisnova commented Feb 28, 2023

What's the reasoning behind this? Do you want to enable applications to interact with auraed?

The main motivator is that this what Kubernetes does today. See the CC-BY-SA licensed diagram here:

image

and the referenced documentation for pod networking.

Every container in a Pod shares the network namespace, including the IP address and network ports. Inside a Pod (and only then), the containers that belong to the Pod can communicate with one another using localhost.

In my experience the network namespace is the "big one" that really matters for a pod. Kubernetes has always maintained that a pod should share local storage and local network. Container volumes make the storage discussion pretty simple as pods just mount volumes between each other, but the network namespace sharing is key for pods to be able to do things like run sidecars.

Should an application interact with the Aurae daemon?

As far as applications interacting with Auraed I think the answer is yes.

I think it's too early on the project to say exactly what an application will use Auraed for specifically. However I know enough about infrastructure, sidecars, and platforms to know that most app teams will want the basics (secrets, service discovery, etc). I think Aurae attempts to simplify a lot of these discussions by bringing small features into scope.

As far as ports being eaten up in the same network namespace, yes that will be a consequence and is exactly why we have the TCP port situation we do today in Kube with pods needing to manage ports in some cases. I think this is the right thing to do, however I'm open to having my mind changed.

Do we want to use eBPF to bridge the network across the namespace?

Now networking on the pod with eBPF -- while exciting -- I think it's wrong to add too much magic there unless absolutely critically necessary. The whole point of Aurae is to be secure and boring, and sharing a network namespace in a simple and boring way without having to manage nested eBPF probes seems like the way to go. I am very traumatized from Kubernetes CNI and I don't want to go down the path of making the world more complex in exchange for some flexibility. I think a much more interesting conversation to have is to admit that the pod sandbox boundary is the network boundary in a pod, and start talking about how to map Linux network devices to the pod.

@MalteJ
Copy link
Contributor

MalteJ commented Feb 28, 2023

I didn't want to create a new network namespace for every container, but create one network namespace within the Micro VM for the pod's containers. Sorry, I used "container" instead of "pod" in my last message.
This way the Pod really looks like the Pod you know from Kubernetes. The control plane within the VM (auraed) gets its own IP address, while the application within the containers also has its own network namespace. You could e.g. provide a v4-only network to the application, while aurae runs IPv6.
netens

@MalteJ
Copy link
Contributor

MalteJ commented Feb 28, 2023

You could reach auraed from within the Pod via e.g. fe80::1 or 169.254.169.254

@krisnova
Copy link
Contributor Author

krisnova commented Mar 1, 2023

I didn't want to create a new network namespace for every container, but create one network namespace within the Micro VM

This is an interesting topology. I am not necessarily opposed to doing something like this by default, however I do have a few questions.

  1. What are we gaining by isolating the container network inside of the pod sandbox?
  2. Does the nested auraed running in the micro VM always exist on the same "flat" network as the root auraed?

Basically what I am fishing for is a supporting argument for the extra complexity of maintaining a container network namespace. Like I mentioned I am still very traumatized from the CNI discussions, and my intuition is telling me to keep things flat/simple and focus on network devices in favor of complex synthetic overlay networks. I understand these overlay networks are possible, I just know they introduce a lot of complexity, risk, and overhead from a performance perspective. I want to simplify things. I want Aurae to be secure, and boring.

Maybe a better way of framing what I am asking:

Is it reasonable to have the root auraed, the nested auraed and the containers all on the same flat network such as 2001:db8:a:b::1 in your diagram? Are there are any strong objections to this? Is this "flat" model a security concern?

@MalteJ
Copy link
Contributor

MalteJ commented Mar 1, 2023

You can have a flat network model if you address the nested auraed by link-local addresses only. Still, you'd have the network namespace for the pod, but it's directly routed into/out of the VM without an overlay or VLAN. So if the nested auraed doesn't need to be reachable from anything but the host auraed, we can use link-local IPv6 addresses, while routing the pod network into the VM.
Not sure if this was enough to understand what I mean. I'm happy to explain in more detail if needed.

I understand you don't like CNIs. But how do you feel about sidecars? By not separating auraed from customers' containers you effectively inject an aurae sidecar into every Pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants