Skip to content
Marcelo Magallon edited this page Jun 3, 2019 · 37 revisions

High level description

starter's code is found in cmd/starter. It's a CGO program, with high-level configuration management implemented in Go and low-level system-access implemented in C.

An init function is used to make sure the Go runtime is configured to use a single goroutine and to pin the main function to a single thread.

The C portion of the code runs before the Go runtime by way of a constructor function.

Two environment variables affect the program's behavior:

  • SINGULARITY_MESSAGELEVEL: the log level
  • PIPE_EXEC_FD: the file descriptor used to pass the configuration (in JSON format) to starter

Flow

Set up

The configuration is placed in shared memory so that the child processes can access it.

The program determines whether its setuid bit is set by examining the auxiliary vector looking for the AT_SECURE attribute. The kernel sets this attribute to a non-zero value to indicate that the program should be treated securely, and this usually means the setuid bit was set.

Why isn't getauxval used? It's been available since glibc 2.16.

If the program is running as root or setuid, it tries to mount an overlay in order to get the kernel to load the overlay module.

Why not modprobe overlay? Probably to account to built-in module (comment should be added to the code).

Privileges are dropped temporarily.

Environment is cleared.

Configuration is read and the configuration file descriptor is closed.

Each of stdin, stdout and stderr is pointed to /dev/null if they are closed. This is done because some programs do not work properly if file descriptors 0, 1 and 2 are closed at start up.

The list of open file descriptors is saved.

SIGCHLD is blocked.

Stage 1 thread is launched sharing open files and filesystem with the main thread both ways. This is achieved by passing CLONE_FILES to the clone call, causing both processes to share the same file descriptor table. CLONE_FS is also passed to clone, causing both processes to the the same filesystem information, including root of the filesystem, current working directory and umask.

Stage 1

Stage 1 is responsible for singularity configuration file parsing, handle user input, read capabilities, check what namespaces is required.

If the binary is setuid, root privileges are restored and prepare stage 1

Return to master

The master thread waits for stage 1 to be done.

The master thread check the exit status of the stage 1 process. If it's non-zero, it exits. If it got a signal, it sends the same signal to itself.

Create a socket pair.

If the container to be started is an instance, fork:

Yield CPU

What is this and why is it necessary?

The new list of open file descriptors is captured

The two file descriptor lists are compared and any new file descriptors that correspond to tty devices; anonymous inodes (obtained from calling epoll_create, inotify_init, eventfd, etc); and any that cannot be resolved (/proc/$pid/fd/$fd symlink is broken or cannot be read) are all closed. The file descriptors corresponding to the socket pair are ignored. For all the other file descriptors, the close-on-exec flag is set.

Why?

User namespace is initialized. This elevates privileges if any of these conditions are true:

Note that from here on the process might be operating with elevated privileges (see above).

In the same step, if a new user namespace is requested, a user namespace is not specified and shared mount is not requested, then CLONE_NEWUSER is added to fork flags.

Mount propagation is set up.

If fork flags is exactly CLONE_NEWUSER, a file descriptor for event notification is set up.

If a join mount is not requested, the RPC socket pair is set up.

If the process is running suid, the filesystem ID is reset to the real ID of the calling process.

A pipe is created for synchronization.

PID namespace is set up. This adds CLONE_NEWPID to fork flags if a new PID namespace is requested.

Stage 2

Stage 2 is started.

Set up process to be killed if parent dies.

Rendezvous with master on user namespace mappings and apply user namespace mappings.

Close one end of the master socket pair.

Initialize network namespace.

Initialize hostname (UTS) namespace.

Initialize IPC namespace.

Initialize cgroup namespace.

Initialize mount namespace.

Rendezvous with master process on sync pipe.

Master

If a new PID namespace is requested and a new mount namespace is requested, a new PID namespace is created.

If a new user namespace is requested, set up the new user namespace mappings for the stage 2 process. It rendezvous with the stage 2 process.

Terminal control is passed to stage 2 process.

Close one end of the master socket pair.

Rendezvous with stage 2 process on sync pipe.

Stage preparation

Stage preparation is the same for all the stages, it only changes as a function of the current configuration.

TBD

Configuration

The configuration structure consists of:

  • capabilities: the set of permitted, effective, inheritable, bounding and ambient Linux capabilities
  • namespace: the network, mount, user, IPC, UTS, cgroup and PID namespace information
  • container
  • json: the entire configuration as a JSON object

Implementation

Logging