Skip to content

covidclinical/Phase2.1DockerAnalysis

Repository files navigation

4CE Phase 2.1 Computing Environment

This document describes the intended use of the Docker image that has been create by and for the 4CE consortium to support the execution of analytic code at individual participating sites.

Table of Contents

  1. Prerequisites
  2. Starting the Container
  3. Connecting to the Container
  4. Offline Usage
  5. Other Information

1. Prerequisites

Docker

In order to run this container you need to have the Docker runtime installed.

See: https://docs.docker.com/get-docker/

Host Data Access

While not a requirement to run the container and connect to it, per se, this image is intended to be run on a host that has file system access to the data generated for the 4CE Phase 2.1 projects. Without such access, none of the quality control or analysis packages can be run. Please make sure that the host where you will run the container has access to the files (e.g., they are saved on an accessible local or network storage location) generated by the 4CE Phase 2.1 extraction routine.

2. Starting the Container

Once you have the Docker runtime installed, you can pull and run the container image from the DockerHub Registry with a small set of simple command line instructions.

To pull the image from DockerHub and run the container:

docker run --rm --name 4ce -d -v /SOME_LOCAL_PATH:/4ceData \
                            -p 8787:8787 \
                            -p 2200:22 \
                            -e CONTAINER_USER_USERNAME=REPLACE_ME_USERNAME \
                            -e CONTAINER_USER_PASSWORD=REPLACE_ME_PASSWORD \
                            dbmi/4ce-analysis:version-2.4.0

Previous versions of this documentation suggested running the latest tag of the image. For reproducibility, we are now asking all sites to run an explicitly named tag and keep a record of which version they are using for their analyses.

To remove any existing versions of the latest tag:

docker image rm dbmi/4ce-analysis:latest

Parameters

Bind Mount Volume /SOME_LOCAL_PATH

/SOME_LOCAL_PATH should be replaced by the path to the directory on the host where the site's data files (generated by the 4CE Phase 2.1 extraction routine) are available. In the running container, this directory on the host will be mounted at /4ceData. The permissions on the host data directory pointed to by /SOME_LOCAL_PATH needs to be effective read + write + execute for the user who is running the container from the command line.

Environment Variables CONTAINER_USER_USERNAME and CONTAINER_USER_PASSWORD

These are the username and password that will get created on the container, and will be used to connect to it via ssh, or to log into the R Studio Server Web UI.

Port Mapping

The -p flag to Docker maps a TCP port in the container to a TCP port on the Docker host. More information is available here. For example, in the above invocation, we are mapping TCP 8787 in the container to TCP 8787 on the Docker host, and TCP 22 (ssh) in the container to TCP 2200 on the Docker host. This allows the user to connect to the container by ssh'ing to localhost on port 2200, or aiming a web browser at port 8787 on localhost to connect to R Studio Server. More information is available in Connecting to the Container.

Image Name and Tag (Version)

The final line of the Docker command above: dbmi/4ce-analysis:version-2.4.0 specifies the image name and tag that will be run. The image name is dbmi/4ce-analysis and the tag is the string following the colon, e.g. version-2.4.0. We will use tags to track the version of the container that each site is running locally. We will attempt to keep this documentation up-to-date with instructions for running the latest release version, but you can always refer to the container's registry page https://hub.docker.com/repository/docker/dbmi/4ce-analysis/tags?page=1 for the complete list of available container versions.

3. Connecting to the Container

Now that you have the continer running, it's time to connect to it and run R. When the container is run with the command described in Starting the Container, both an SSH server and an R Studio Server instance are started inside the container. You have three options for how you will interact with an R session running in the container. You can

  1. Use a secure shell (SSH) client to connect to the container and run R on the command line
  2. Use a web browser to connect to R Studio Server running in the container
  3. Run the container interactively and run R on the command line directly, without an SSH client

These options, along with slight variations, are described in the following sections.

1. Connecting via SSH

If you are connecting to the container via ssh, you will need an SSH client. Linux and macOS typically have a command line ssh client installed out of the box. For Windows systems you will need to download an SSH client such as PuTTY (https://www.putty.org/).

To connect via ssh you'll use the following command, which assumes the default user/password:

ssh dockeruser@HOST_ADDRESS -p 2200

where HOST_ADDRESS is the IP address of the Docker host. Recall that we are mapping the ports that the SSH server and R Studio Server are using from the container to the Docker host. If you are running the ssh command on the same host where Docker is running, you can substitute localhost for HOST_ADDRESS. If you are ssh'ing to the container from another host (e.g., running Docker on a server, connecting from a workstation/laptop), then you will need to substitute the IP address of the Docker host for HOST_ADDRESS. In this latter case, you will also need to ensure that any relevant firewalls allow connections on TCP port 2200 from the "workstation" to the "server". A full discussion of the principals of firewall configuration are beyond the scope of this document. For further assistance, you may need to consult your local information technology team.

Once you have successfully established an SSH connection to the container, you can run an R command-line session.

If you have previously run another version of the container on the same host, you may receive a message like the following when SSH attempts to connect:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

The SSH client keeps a record of host identifiers associated with IP addresses. As the version of the container changes, the host identifier may change as well, causing the SSH client to raise the warning to alert the user to the fact that they are connecting to a different host than they had previously at this network address.

You will need to either manually remove the old host association with the IP address (e.g., by deleting the entry from the client's ~/.ssh/known_hosts file), or tell the SSH client to ignore known hosts by adding -o GlobalKnownHostsFile=/dev/null -o UserKnownHostsFile=/dev/null to the argument list for the ssh command above. The full command would therefore look like:

ssh dockeruser@HOST_ADDRESS -p 2200 -Y  -o GlobalKnownHostsFile=/dev/null -o UserKnownHostsFile=/dev/null

2. Connecting to R Studio in a Web Browser

Some users may prefer to use the R Studio Server IDE in a web browser instead of running R on the command line.

If you are running the web browser on the same host where the container is running, you should be able to navigate the browser to http://localhost:8787 to access R Studio Server. You will be prompted to enter the username and password that were passed to the docker run command.

If you are running the web browser on a different host from the Docker host (e.g., running Docker on a server, connecting from a workstation/laptop), then you will need to substitute the IP address of the Docker host for localhost in the above URL. In this latter case, you will also need to ensure that any relevant firewalls allow connections on TCP port 8787 from the "workstation" to the "server". A full discussion of the principals of firewall configuration are beyond the scope of this document. For further assistance, you may need to consult your local information technology team.

Remote server with only port 22 access

Some institutions may require that network access to the Docker host be restricted to port 22 (SSH). In this case, clients can still connect to the R Studio Server web UI by utilizing SSH port forwarding. In this scenario, where the container is running on a remote "server", you first need to initiate an SSH tunnel with the following command on the client workstation:

ssh -L 8787:DOCKER_HOST_ADDRESS:8787 USERNAME@DOCKER_HOST_ADDRESS

You should substitute the network address of the Docker host where the container is running in place of DOCKER_HOST_ADDRESS.
This command is invoking SSH to create an encrypted tunnel between TCP port 8787 on the local host to TCP port 8787 on the host at DOCKER_HOST_ADDRESS. Firewalls do not need to be configured to allow TCP port 8787 to connect from the client to the server; rather, this tunnel is created over the standard SSH port 22. Therefore, you will also need to ensure that any relevant firewalls allow connections on TCP port 22 from the "workstation" to the "server". A full discussion of the principals of firewall configuration are beyond the scope of this document. For further assistance, you may need to consult your local information technology team.

For more background on how SSH tunneling works, please see: https://www.ssh.com/ssh/tunneling/example.

Note that USERNAME in the above ssh command (and the respective password that you will be prompted to enter) is the local system credential on the host that resides at DOCKER_HOST_ADDRESS, not the ephemeral container credential.

If succesful, the client should be able to visit http://localhost:8787 to see RStudio Server.

Similarly, if you are restricted to only TCP port 22 access to the Docker host, but wish to run R from the command line instead of running R Studio, you can first SSH to the Docker host (on port 22) then follow these instructions to connect to the container.

3. Connecting Interactively

A final option is to run the container in a way that directly presents the user with an interactive R session. The container will stop after you quit this R session. Note, this command runs the container so you can't already have one running when issuing it. The --rm flag will ensure that when the R session is quit, the container is stopped and cleaned up.

docker run --name 4ce -v /SOME_LOCAL_PATH:/4ceData --rm -it dbmi/4ce-analysis:version-2.4.0 R

4. Offline Usage

Some sites will have security controls in place that prevent a host that has access to the patient-level data required to run the analyses from connecting to external networks. In these circumstances, you will need to employ a second "bastion" host that will serve as the transfer mechanism to move the container image onto the isolated host, and then to transfer the summary result files generated by the analyses to the respective GitHub repositories.

Under this arrangement, the container image will be pulled from the Docker Hub registry onto the bastion host (which itself needs to have the Docker runtime installed). The container can then be run on the bastion host if any additional configuration requires internet access (e.g., installation of additional packages), and saved as a new Docker image. The image (whether the original one from the registry or an updated one) will then be saved to a .tar file, which can be transferred (e.g. via scp) to the isolated host. The image is then run on the isolated host as usual, with access to the required input data. The analysis packages are designed by default to save their results to a scratch file system location that is local to the contianer. Thus, the analyses can be run on the isolated host, the modified container (including the result files) can be saved as a new image, the image transferred back to the bastion host, and the result files uploaded.

In more detail, the steps are are follows:

1. On the bastion host: Pull the container image from the registry

docker image rm dbmi/4ce-analysis:version-2.4.0
docker pull dbmi/4ce-analysis

2. On the bastion host: Run the container, perform any desired customization, and save to a new image

See Starting Container above for information on running the container. See Connecting above for information on connecting to a running container. Once your modifications (package installations, updates, etc.) are complete, leave the container running and, in a separate shell, first get the container id of the running container:

docker ps

Then create a new image:

docker commit <CONTAINER_ID> 4ce_offline:updated

You should now see this image listed if you run

docker images | grep 4ce_offline

You can now stop the continer.

3. On the bastion host: Transfer the image to the isolated host as a .tar file

Save the image as a .tar file:

docker save 4ce_offline:updated > ./4ce_offline_updated.tar

Now transfer 4ce_offline_updated.tar to the isolated host using, e.g., scp or ftp.

4. On the isolated host: Load the .tar file as an image in Docker

docker load < 4ce_offline_updated.tar

5. On the isolated host: Run the container

Now you can run the container as indicated above. You will need to replace the name of the image with the name you used above.

docker run --rm --name 4ce -d -v /SOME_LOCAL_PATH:/4ceData \
                            -p 8787:8787 \
                            -p 2200:22 \
                            -e CONTAINER_USER_USERNAME=REPLACE_ME_USERNAME \
                            -e CONTAINER_USER_PASSWORD=REPLACE_ME_PASSWORD \
                            4ce_offline:updated

6. On the isolated host: Execute the desired analyses, saving results to the container's local file system

See Connecting above for information on connecting to a running container. Each of the 4CE Phase 2 projects are implemented from the same template code, so should only require running PackageName::runAnalysis() and PackageName::validateAnalysis(), where PackageName is the name of the analysis package you would like to run. By default, these packages save their output to the container's local file system.

7. On the isolated host: Save the running container (with result files) as a new image

Get the container id of the running container:

docker ps

Then create a new image:

docker commit <CONTAINER_ID> 4ce_offline:with_results

You can stop the container after the above command completes.

8. On the isolated host: Transfer that new image as a .tar file back to the bastion host

docker save 4ce_offline:with_results > ./4ce_offline_with_results.tar

Now transfer 4ce_offline_with_results.tar to the isolated host using, e.g., scp or ftp.

9. On the bastion host: Load the .tar file as an image in Docker

docker load < 4ce_offline_with_results.tar

10. On the bastion host: Run the container

Now you can run the container as indicated above. You will need to replace the name of the image with the name you used above.

docker run --rm --name 4ce -d -v /SOME_LOCAL_PATH:/4ceData \
                            -p 8787:8787 \
                            -p 2200:22 \
                            -e CONTAINER_USER_USERNAME=REPLACE_ME_USERNAME \
                            -e CONTAINER_USER_PASSWORD=REPLACE_ME_PASSWORD \
                            4ce_offline:with_results

11. On the bastion host: Upload the result files to GitHub

See Connecting above for information on connecting to a running container. You can now run PackageName::submitAnalysis() for each of the projects to upload their results to GitHub.

5. Other Information

Preserving State

In general, no state (file system, running processes, etc.) will be preserved when the Docker container is terminated and re-run. If you need to persist files, you should write them to the directory mounted to the container using the -v argument in the docker invocation. This option will share a directory from the host environment, making it available in the running container. Anything in that directory will therefore be preserved when the container is stopped. For sites that need to run the container on a host that is isolated from the internet, there may be a need to persist the intermediate analysis results while the container is moved to a network location where it can push the files to GitHub. See Offline Usage for details. Documentation in the Phase2.1UtilitiesRPackage contains information on default container-local file system locations that are recommended for use as intermediate scratch space for use by analytic packages.

Stopping the container

When you need to stop the container (e.g., when you are done running QC and analyses) you can issue the following docker command:

docker kill 4ce

Restarting R Studio Server

R Studio Utility is installed in /usr/sbin

If you need to restart rstudio server inside the container (e.g., because it becomes unresponsive)

/usr/sbin/rstudio-server restart