Sandboxing notes

Sandboxing

Brainstorming -

Isolation:

namespaces (see seperate namespaces notes in wiki)
capabilities
- linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities
- can be set per user_namespace (see namespace notes)
cgroups
- control groups allow processes to be organized into groups and these groups' usage of resources can be limited and monitored
- There are also cgroup namespaces (see namespace notes)
chroot
- allows you to limit processes' view of filesystem by only allowing it to view a subset of filesystem

Container environment:

docker - https://www.docker.com/resources/what-container
kubernates - https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/
BusyBox - minimal filesystem used for docker images
yocto - https://www.yoctoproject.org/docs/
debootstrap - https://manpages.debian.org/unstable/debootstrap/debootstrap.8.en.html

Filtering:

seccomp + BPF
- minimize exposed kernel surface
- when you find a process making a system call there are multiple options:
- - allow it to continue
  - force it to return an error
  - kill thread
  - kill process
LD Preload
- lets you load your own libraries instead of other libraries programs may try to use. This allows changed or added logic to function calls.
- System calls can still just be called directly by program
LSM (Linux Security Model)
- https://www.starlab.io/blog/a-brief-tour-of-linux-security-modules/
- denies process access to important kernel objects
- implement mandatory access control (MAC)
- can install LSM’s in hooks within the kernel that is called by LSM framework to see if security requirements are met
- some existing frameworks:
- SMACK
  - developed for embedded devices
- TOMOYO
  - uses recorded testing sessions to figure out what acceptable behavior looks like and then uses that to create its MAC policies more easily

Goal:

Provide lightweight sandboxed isolation for tenant processes with minimal container orchestration overhead. We should be able to let tenants run their code (untrusted by us) without fear of malicious activity. Tenants shouldn't be able to mess with other tenants' resources on the system and their available attack surface of the kernel should be limited as much as possible. Tenants should have a very limited view of the system. It should seem to tenants that they are alone on the system, with their own global resources. System calls should be monitored with Seccomp and BPF. Many function calls will be LD preloaded by Sunneed so that before functions run, Sunneed can do checks for privilege and energy available to the tenant as energy is one of the global resources we are trying to control. This means that there will be an interposition layer over processes that will allow them to program with the standard POSIX api while actually being forced to transparently use our own Sunneed api. This means things like making network connections and getting file descriptors will work with Sunneed using IPC to pass these requested resources to tenant applications.

Decisions (so far):

The first decision was whether or not to use an existing Container option such as Docker or Kubernates. Both of these options provide a lot of options in creating your container images but both also have a great deal of orchestration overhead. A big problem I see with this orchestration as is, is that not only are we dealing with constrained resources on an embedded system, we are already trying to provide Sunneed, acting in userspace, as a means of tenant orchestration and this would be complicated by using one of these container options.

I began this research by working with seccomp examples and starting the seccomp filter code. The main decision with this is whether to blacklist or whitelist system calls. I've decided so far to whitelist system calls. This means only allowing system calls from a specified list of functions. Blacklisting would be only rejecting system calls we don't want the tenant to be able to use. Whitelisting is more all encompassing because there could possibly be malicious ways to use system calls that weren't predicted and therefore weren't blacklisted.

The next decision I made was whether to use chroot or namespaces to provide a tenant with an isolated view of the system. Chroot is outdated and there are a number of ways to escape chroot sandboxes. Namespaces provide a great deal more isolation of global resources so I've decided to go in that direction. There are namespaces for each of the major global resources within the system, and each can be used alone or together to provide varias levels of isolation.

The mount_namespace allows a process to mount its own seperate filesystem. To actually use mount_namespaces, you need to pivot a processes root directory, from the host root to the root of the new sandboxed filesystem. For this I needed to decide how to get that filesystem.

The first thing I tried was using BusyBox which is a minimal fs used for creating base docker images. This had problems because it doesn't have glibc libary support. This means that a program requiring even standard libraries wouldn't be able to run at first. You would have to manually load any library you need into BusyBox.
Next, I tried just creating a directory to function as the tenant's root, and soft linking everything in the host root directory into the tenant's root. This had problems when attempting to utilize the mount_namespace to mount the tenants root. Since the directories within the tenant's root were only linked, and those links came from the host, there were issues. The issue is that when in the new mount namespace, the mounts from the host's mount namespace are lost for the tenant. Another issue, is that even if those sub-root directories are visible to the tenant, this new mount_namespace should be mounting a unique /proc mount for example. If the tenant process attempts to mount /proc which was soft linked from the original root, this /proc has already been mounted and therefore this namespace wouldn't be isolating itself.
Next, I researched the Yocto project. The yocto project is an open source project that allows one to create their own custom linux-based system image. This can be perfect for tailer making an OS to function for a specific embedded device like a raspberry pi. It definitely has advantages and seems like a very strong option that I still haven't fully ruled out. However, there is also a disadvantage. The learning curve is very steep but it is apparently worth it if needed. I'm just unsure if we need this level of customization for our sandboxed environment. We just need a place for our tenants to run their code. I belive a simple debian filesystem would suffice for our needs...
... Therefore, I as of now, I am using debootstrap to install a debian root fs. This allows you to specify the system architecture, package, mirror, and destination for the image to be installed. It is very simple and effective. I can then set up a tenant process with the seccomp filter and placing it into its own various namespaces, and let it run isolated and sandboxed within this simple debian fs. This fs is then rotated to the root of the mount namespace and the user will only see the debian system they are locked into.

Need to look more into:

capabilities & cgroups
Yocto (at least a bit more research)
LSM's
defending against LD Preload workarounds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly