{Singularity} can make use of various Linux kernel features to modify the security scope and context of running containers. Non-root users may be granted additional permissions using Linux capabilities. SELinux, AppArmor, and Seccomp can be used to restrict the operations that can be performed by a container.
In {Singularity}'s default configuration, without --oci
, a container started by root receives all capabilities, while a container started by a non-root user receives no capabilities.
Additionally, {Singularity} provides support for granting and revoking Linux capabilities on a user or group basis. For example, let us suppose that an administrator has decided to grant a user (named pinger
) capabilities to open raw sockets so that they can use ping
in a container where the binary is controlled via capabilities. For information about how to manage capabilities as an admin please refer to the capability admin docs.
Note
In {Singularity}'s default setuid and non-OCI mode, containers are only isolated in a mount namespace. A user namespace, which limits the scope of capabilities, is not used by default.
Therefore, it is extremely important to recognize that granting users Linux capabilities with the capability
command group is usually identical to granting those users root level access on the host system. Most, if not all, capabilities will allow users to "break out" of the container and become root on the host. This feature is targeted toward special use cases (like cloud-native architectures) where an admin/developer might want to limit the attack surface within a container that normally runs as root. This is not a good option in multi-tenant HPC environments where an admin wants to grant a user special privileges within a container. For that and similar use cases, the fakeroot feature <fakeroot>
is a better option.
To take advantage of this granted capability as a user, pinger
must also request the capability when executing a container with the --add-caps
flag like so:
$ singularity exec --add-caps CAP_NET_RAW library://sylabs/tests/ubuntu_ping:v1.0 ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=52 time=73.1 ms
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 73.178/73.178/73.178/0.000 ms
If the admin decides that it is no longer necessary to allow the user pinger
to open raw sockets within {Singularity} containers, they can revoke the appropriate Linux capability and pinger
will not be able to add that capability to their containers anymore:
$ singularity exec --add-caps CAP_NET_RAW library://sylabs/tests/ubuntu_ping:v1.0 ping -c 1 8.8.8.8
WARNING: not authorized to add capability: CAP_NET_RAW
ping: socket: Operation not permitted
Another scenario which is atypical of shared resource environments, but useful in cloud-native architectures is dropping capabilities when spawning containers as the root user to help minimize attack surfaces. With a default installation of {Singularity}, containers created by the root user will maintain all capabilities. This behavior is configurable if desired. Check out the capability configuration and root default capabilities sections of the admin docs for more information.
Assuming the root user will execute containers with the CAP_NET_RAW
capability by default, executing the same container pinger
executed above works without the need to grant capabilities:
# singularity exec library://sylabs/tests/ubuntu_ping:v1.0 ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=52 time=59.6 ms
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 59.673/59.673/59.673/0.000 ms
Now we can manually drop the CAP_NET_RAW
capability like so:
# singularity exec --drop-caps CAP_NET_RAW library://sylabs/tests/ubuntu_ping:v1.0 ping -c 1 8.8.8.8
ping: socket: Operation not permitted
And now the container will not have the ability to create new sockets, causing the ping
command to fail.
The --add-caps
and --drop-caps
options will accept the all
keyword. Of course appropriate caution should be exercised when using this keyword.
When containers are run in OCI-mode, by a non-root user, initialization is always performed inside a user namespace. The capabilities granted to a container are specific to this user namespace. For example, CAP_SYS_ADMIN
granted to an OCI-mode container does not give the user the ability to mount a filesystem outside of the container's user namespace.
Because of this isolation of capabilities users can add and drop capabilities, using --add-caps
and --drop-caps
, without the need for the administrator to have granted permission to do so with the singularity capabilities
command.
OCI-mode containers do not inherit the user's own capabilities, but instead run with a default set of capabilities that matches other OCI runtimes.
- CAP_NET_RAW
- CAP_NET_BIND_SERVICE
- CAP_AUDIT_READ
- CAP_AUDIT_WRITE
- CAP_DAC_OVERRIDE
- CAP_SETFCAP
- CAP_SETPCAP
- CAP_SETGID
- CAP_SETUID
- CAP_MKNOD
- CAP_CHOWN
- CAP_FOWNER
- CAP_FSETID
- CAP_KILL
- CAP_SYS_CHROOT
When the container is entered as the root user (e.g. with --fakeroot
), these default capabilities are added to the effective, permitted, and bounding sets.
When the container is entered as a non-root user, these default capabilities are added to the bounding set.
When starting a container with the action commands shell
, exec
, and run
, various flags allow fine grained control of security.
In the default non-OCI-mode, --add-caps
will grant specified Linux capabilities (e.g. CAP_NET_RAW
) to a container, provided that those capabilities have been granted to the user by an administrator using the capability add
command. This option will also accept the case insensitive keyword all
to add every capability granted by the administrator.
In OCI-mode, --add-caps
will grant specified Linux capabilities (e.g. CAP_NET_RAW
) to the container. Because the container runs in a user namespace, the capabilities are not effective on the host and do not have to be granted by the administrator. The keyword all
will grant all available capabilities to the container.
In the default non-OCI-mode, the root user has a full set of capabilities when they enter the container. You may choose to drop specific capabilities when you initiate a container as root to enhance security.
For instance, to drop the ability for the root user to open a raw socket inside the container:
$ sudo singularity exec --drop-caps CAP_NET_RAW library://centos ping -c 1 8.8.8.8
ping: socket: Operation not permitted
In OCI-mode any user can use --drop-caps
to run a container with fewer capabilities than the default OCI capability set.
The --drop-caps
option will also accept the case insensitive keyword all
as an option to drop all capabilities when entering the container.
The SetUID bit allows a program to be executed as the user that owns the binary. The most well-known SetUID binaries are owned by root and allow a user to execute a command with elevated privileges. But other SetUID binaries may allow a user to execute a command as a service account.
By default SetUID is disallowed within {Singularity} containers as a security precaution, by mounting container filesystems as nosetuid.
In the default non-OCI-mode, the root user can override this precaution and allow SetUID binaries to behave as expected within a {Singularity} container with the --allow-setuid
option like so:
$ sudo singularity shell --allow-setuid some_container.sif
In OCI-mode, any user can permit SetUID binaries with the --allow-setuid
option. Because an OCI-mode container is always run in a user namespace, SetUID will change to UIDs inside a user's permitted subuid/subgid mapping. This does not allow access to arbitrary UIDs on the host system.
In the default non-OCI-mode, it is possible for an admin to set a different set of default capabilities or to reduce the default capabilities to zero for the root user by setting the root default capabilities
parameter in the singularity.conf
file to file
or no
respectively. If this change is in effect, the root user can override the singularity.conf
file and enter the container with full capabilities using the --keep-privs
option.
$ sudo singularity exec --keep-privs library://centos ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=128 time=18.8 ms
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 18.838/18.838/18.838/0.000 ms
In OCI-mode, the --keep-privs
option can be used by any user. In this mode, --keep-privs
will cause the container to run inheriting the current effective capabilities rather than using the OCI default capability set. When entering the container as a non-root user, the capabilities are only inherited to the bounding set.
In the default non-OCI-mode, the --no-privs
option allows the root user to run a container with all capabilities dropped, and sets the no_new_privs
bit that will prevent the container process gaining any further privilege.
In OCI-mode, the --no-privs
option can be used by any user to run a container with all capabilities dropped, and to set the no_new_privs
bit that will prevent the container process gaining any further privilege.
The --security
flag, currently supported in non-OCI-mode only, allows the root user to leverage security modules such as SELinux, AppArmor, and seccomp within your {Singularity} container. It is also possible to change the UID and GID of the user within the container at runtime.
For instance:
$ sudo whoami
root
$ sudo singularity exec --security uid:1000 my_container.sif whoami
david
To use seccomp to blacklist a command follow this procedure. (It is actually preferable from a security standpoint to whitelist commands but this will suffice for a simple example.) Note that this example was run on Ubuntu and that {Singularity} was installed with the libseccomp-dev
and pkg-config
packages as dependencies.
First write a configuration file. An example configuration file is installed with {Singularity}, normally at /usr/local/etc/singularity/seccomp-profiles/default.json
. For this example, we will use a much simpler configuration file to blacklist the mkdir
command.
{
"defaultAction": "SCMP_ACT_ALLOW",
"archMap": [
{
"architecture": "SCMP_ARCH_X86_64",
"subArchitectures": [
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
]
}
],
"syscalls": [
{
"names": [
"mkdir"
],
"action": "SCMP_ACT_KILL",
"args": [],
"comment": "",
"includes": {},
"excludes": {}
}
]
}
We'll save the file at /home/david/no_mkdir.json
. Then we can invoke the container like so:
$ sudo singularity shell --security seccomp:/home/david/no_mkdir.json my_container.sif
Singularity> mkdir /tmp/foo
Bad system call (core dumped)
Note that attempting to use the blacklisted mkdir
command resulted in a core dump.
The full list of arguments accepted by the --security
option are as follows:
--security="seccomp:/usr/local/etc/singularity/seccomp-profiles/default.json"
--security="apparmor:/usr/bin/man"
--security="selinux:context"
--security="uid:1000"
--security="gid:1000"
--security="gid:1000:1:0" (multiple gids, first is always the primary group)
Beginning in {Singularity} 3.4.0 it is possible to build and run encrypted containers. The containers are decrypted at runtime entirely in kernel space, meaning that no intermediate decrypted data is ever present on disk. See encrypted containers <encryption>
for more details.