Linux sandboxing

Linux 3.12 is assumed, along with the following configuration options:

CONFIG_USER_NS=y
CONFIG_NAMESPACES=y

Namespaces

Namespaces are an isolation hierarchy for kernel resources. A full set of fresh namespaces is comparable to a virtual machine, with a shared kernel. A sandbox will likely desire full isolation:

unshare(CLONE_NEWUSER|CLONE_NEWIPC|CLONE_NEWNS|CLONE_NEWPID|CLONE_NEWUTS|CLONE_NEWNET);
spawn(sandboxed_process)

Essential namespaces

CLONE_NEWNS

A mount namespace is the file hierarchy available to a process, consisting of the tree of mounts with ownership over their submounts. A mount and the owned submounts can be marked shared, private, slave or unbindable.

shared: changes propagate to all other namespaces
private: changes do not propagate
slave: changes propagate from the master, but not vice-versa
unbindable: private, and cannot be cloned through a bind operation

A fully private mount namespace works well for an application sandbox. It allows for having a hidden lightweight read-only directory to chroot into with only the necessary devices (/dev/urandom) and mounts (/proc, and maybe a tmpfs).

Obtaining isolation in a mount namespace (root is a directory to chroot into, with proc sub-directory):

// avoid propagating mounts to or from the real root
if mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL) < 0 {
    fail!("mount /")
}

// turn directory into a bind mount
if mount(root, root, "bind", MS_BIND|MS_REC, NULL) < 0 {
    fail!("bind mount")
}

// re-mount as read-only
if mount(root, root, "bind", MS_BIND|MS_REMOUNT|MS_RDONLY|MS_REC, NULL) < 0 {
    fail!("remount bind mount")
}

if chroot(root) < 0 {
    fail!("chroot")
}

if chdir("/") < 0 {
    fail!("chdir")
}

if mount(NULL, "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, NULL) < 0 {
    fail!("mount /proc")
}

CLONE_NEWNET

An isolated network namespace, with only a loopback device by default. Virtual network devices can be given to the namespace, but that functionality should be unnecessary for Servo.

CLONE_NEWPID

An isolated process namespace, where the initial process is considered init and has PID 1. A remount of /proc is required to update it for the new namespace. Since a sandbox will usually involve a chroot, that's a given. Note that the PID namespace is entered upon forking a child process, not immediately like the others.

CLONE_NEWUSER

An isolated user/group namespace, where UID/GID values do not correspond to values outside of the namespace even when equal. This is the most essential, as it's a requirement for using namespaces without CAP_SYS_ADMIN.

To reduce the kernel attack surface, it will obviously be a good idea to drop from the pseudo-root user immediately. This will mean having the bare essentials of a user database in /etc for the chroot.

User namespaces were automatically disabled if XFS was enabled before Linux 3.12, so that is essentially going to be the soft minimum requirement. However, distributions still need to enable the CONFIG_USER_NS switch, and they may not want to do it right away due to security risks.

It seems Fedora is starting off with it enabled, but with a patch to add the restriction of CAP_SYS_ADMIN. User namespaces were primarily added to allow for unprivileged containers, so the restriction should go away eventually

Unimportant namespaces

CLONE_NEWUTS

Essentially just an isolated domain name and host name. There's no harm in hiding this information!

CLONE_NEWIPC

An isolated view of SystemV IPC and POSIX message queues. Again, not very interesting, but obviously a good idea since there's nothing to lose.

Seccomp

seccomp-bpf is essentially iptables for system calls. It allows building a whitelist of allowed system calls, and adding arbitrary integer comparison checks for each of the parameters. For Servo, this will primarily be useful for reducing the kernel attack surface.

The value of seccomp for isolation approaches zero as new system calls are required, because the parameters cannot usually be restricted much. There are at least a few system calls with information leaks, like the ability to obtain kernel logs if dmesg.restrict is unset.

https://github.com/thestinger/rust-seccomp

The runtime alone needs a large number of system calls, so namespaces are going to be much more valuable as a starting point.

Other details

use setsid to make a fresh session
use setresgid/setresuid for dropping pseudo-root
make sure inherited file descriptors aren't breaking the sandbox
make sure to wipe out the environment

Cgroups

Control groups are very interesting for restricting resource usage, but they are a system administrator feature at this point and stopping a denial of service via remote code execution in the sandbox is a low priority. Rather than using control groups, resource limits can be set for the process and seccomp can be used to prevent spawning new processes or changing the limits.

SELinux/AppArmor

Not portable, and also essentially a system administrator feature. There's not much that can be done here anyway because users will expect a browser as a whole to have access to the filesystem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly