Skip to content

Security Approach and Roadmap

jbradbury edited this page Jul 24, 2018 · 4 revisions

Original contributors: Ola Spjuth, Pablo Moreno, Marco Capuccini, Anders Larsson, Jon Ander Novella, Kristian Peters, Matteo Carone

Background and challenges

Scientific research is rapidly moving towards cloud computing, especially in computationally intensive areas like physics and life sciences. Work traditionally carried out in High Performance Computer Systems (HPC), inside secure intranets, is now also increasingly migrating to decentralized data centers run by various cloud providers. As such, security concerns that were deemed less relevant due to running inside a private network are now brought into the light.

PhenoMeNal (http://phenomenal-h2020.eu/) is a European H2020 project that provides the means for users to instantiate Virtual Research Environments (VRE) on top of public and private cloud providers. These VREs are based on tools being available as software containers, which are orchestrated by Kubernetes. For chaining individual tools into analysis pipelines, scientific workflow systems like Galaxy, Luigi, Jupyter, and Pachyderm are used. A recent description of the architecture and example applications are available in the preprint available on https://www.biorxiv.org/content/early/2017/11/24/213603.

When users launch a PhenoMeNal VRE, they instantiate and configure a virtual infrastructure of e.g. compute nodes, storage nodes, networks, DNS. The launch process also configures and starts up a set of services such as Kubernetes and Galaxy, and during analyses the system pulls software containers from different repositories that are executed in this environment. There are hence a number of security aspects that need to be considered when developing and running such an environment. This document is intended as a description of the approach taken by PhenoMeNal together with potential attacks, to form a basis for future works to strengthen security on all levels.

Items below marked with ROADMAP are not yet in place and is to be considered for implementation based on prioritization.

The PhenoMeNal approach to security

General approach

To a large extent PhenoMeNal relies on widely used, community-scrutinized open source software frameworks. These are continuously patched, and high importance is placed on using the latest releases in PhenoMeNal. To this end, the continuous integration system is a key component.

Network communication

During deployment of PhenoMeNal VRE, KubeNow [https://github.com/kubenow/KubeNow] initializes the virtual infrastructure and configures access to the VRE via Cloudflare [http://cloudflare.com/] for dynamic DNS services. This means that all communication with services inside the VRE (e.g. Galaxy) are encrypted. If the user decides to not use CloudFlare in the deployment, there is no encryption. The launched VRE is only reached by standard ports ssh/http/https and port 44 for the Galaxy sftp-downloader. All others are denied by the firewall; we use cloud-specific firewalls for all supported cloud providers (AWS, GCP, Azure, OpenStack)..

  • All network traffic between user to Cloudflare is encrypted with cloudflare-managed https
  • Network between Cloudflare and VRE is currently unencrypted. The plan is to set this up in the KubeNow deployment script. ROADMAP

Development traffic includes communication to/from PhenoMeNal docker registry and the continuous integration systems Jenkins (https://phenomenal-h2020.eu/jenkins/) and Travis (https://travis-ci.org/kubenow/KubeNow).

  • Communication to/from PhenoMeNal docker registry (https://docker-registry.phenomenal-h2020.eu/) is encrypted. Only communication between Jenkins and our docker registry is based on trusted hosts (not encrypted) as it happens inside the OpenStack tenancy where it operates.
  • The Kubernetes cluster running Jenkins and docker registry, as well as portal and portaldev, runs on CoreOS, which is a self-updatable, cluster-aware system, with most portions being read-only, made to run everything as a container. It reboots nodes sequentially to avoid lack of availability.
  • Communication to/from Jenkins is encrypted with https, also protected by 2-factor auth.
  • Communication to/from KubeNow and Travis CI is fully encrypted (https)

Deployment

PhenoMeNal VREs are designed to be launched on-demand and terminated after completed analysis. The deployment uses a base image to speed up provisioning.

  • The latest incremental security patches are applied to the image on startup. Images are re-built on a daily basis and tested for deployment, to avoid that security patches introduce any abnormality in the deployment process.
  • All virtual machines accept only SSH keys, no passwords are allowed
  • For long-running services in the VRE (e.g. Galaxy, Jupyter) the startup script checks and rejects weak application passwords on launch.
  • There is a cron-job to install upgrades that is executed every 24 hours on running VRE. This does however not cover updates requiring reboot.
  • The storage nodes are accessible via SSH only. Storage is not mounted on virtual machines, only on containers used in analysis.
  • Cloud provider credentials are not stored in cluster, only on deployer host.

Services inside VRE

Kubernetes is used for container orchestration. PhenoMeNal relies on KubeAdm for the setup of Kubernetes, which in itself is not reachable at runtime by default. The only way to access it, is by having access to the private key stored on the computer where it was launched or via Kubenetes Dashboard (if that is deployed, not the default).

  • Galaxy
    • How do we secure Galaxy? Galaxy has essentially a user-password mechanism only. It could be attached to more complex user management / security schemes.
  • Jupyter is not run in privileged mode, so is confined to its container. However Jupyter exposes a terminal that is root on container.
  • Luigi provides a web interface which only allows for inspecting what is running. It is secured with basic auth behind https and a non-trivial password.
  • Pachyderm can only be reached via SSH into the master node.
  • We currently offer the possibility to deploy Kubernetes dashboard, which can expose certain low level access to the cluster that can easily compromise it or the running services. It is however secured with basic auth behind https and a non-trivial password.
  • As mentioned above, we only allow strong master passwords for all long-running services.
  • In general, we should avoid processes inside containers run as root as user rights could be circumvented in the infrastructure (e.g. access to NFS volumes, masquerading as LDAP user)

Container sources and build process

Analysis tools in PhenoMeNal are available as software containers. The actual tools are developed as open source, and hence the source can be inspected. This promotes community efforts to discover and resolve bugs and security issues.

  • Code inspection of containers is too resource-demanding and is not a target for PhenoMeNal. There are utilities though to generate reports of built containers, which we could implement at the CI level, when building them.
  • Containers should be signed and VRE should only allow to install signed containers. ROADMAP
  • Security could be improved if all accepted containers came from the same, secure, hardened base container that is continuously patched. This would be a rather big undertaking. ROADMAP
  • PhenoMeNal Docker registry is publicly available, but read only. The reverse proxy that exposes the PhenoMeNal docker registry points to a read-only instance of the container running the registry. A second container for docker registry, which is configured as read-write (push-pull), is not mapped to any port in the reverse proxy (so it is not accessible from the outside). Currently, the CI slave machine that pushes to this read-write instance are inside the same trusted network as this instance of the registry, and they push to it without packets leaving the internal trusted network. If of course the internal OpenStack tenancy trusted network would be compromised, then it would be easy for someone to push to the registry. Making repo private would complicate for users, as we encourage users to use the containers within as well as outside of the VRE without restrictions on having to register and provide passwords on each interaction.
    • Security could be further improved by only allowing for CI system/user to push. ROADMAP

Potential attacks

Because the network to the VRE is encrypted and it is only reachable by http/https/ssh ports, we consider the largest potential risks to be from the long-running services offering a web UI/API. These are protected by passwords, however if attacker manages to acquire this master password or gain access via code injection, then attacker could potentially exploit security holes in the services (e.g. Galaxy, Jupyter) and possibly take over the cluster.

  • Code injection
    • Primarily via Web UI/API
    • Sneak in back-door in container devel process
  • Biggest risks identified
    • Acquire Jupyter password - get terminal that is root in the container, try different exploits from there.
    • Acquire Galaxy password - can attack and expose any potential security holes in the integrated tools, although is highly constrained by the Galaxy UI. In most cases Galaxy tools tend to have drop down menus more often than text fields, and text fields tend to be limited to defined data types (integer, float, string). In this case, only string fields would be possibly dangerous, but they tend to be limited in length (they tend to be fields, not larger input boxes for paragraphs - can't remember the exact HTML name for this). A user that gains the administrative password of Galaxy could eventually be able to add new tools, which would be a higher risk.
      • Possibly a way to mitigate this would be to use third party authentication systems (Galaxy supports LDAP, OpenID and other providers). ROADMAP
    • Acquire Kubernetes dashboard password - can deploy new pods, destroy existing pods (both application and infrastructure related) and potentially take over the cluster. Kubernetes dashboard is not installed by default.
  • Steal Kubernetes token or private key from the person deploying.
  • Memory exploits (e.g. Meltdown and Spectre V1 + V2) - read out memory contents of connected containers - can be mitigated by having the latest kernel patches applied and microcode updates to intel processors.
    • Still need to get into the system.
  • Intercept and decode unencrypted data traffic [Google had the problem a few years ago when the NSA intercepted their traffic because locally within the Google data centers all the traffic was unencrypted. For good scientific practice we have to make sure that communication between kubernetes and other microservices is encrypted as well.]
    • Second layer of defense, not target for now for PhenoMeNal.
  • Denial-of-service by overloading the infrastructure (e.g. command injection through APIs)
    • Cloudflare has some protection in place for this

ELSI/GDPR approach

Phenomenal provides an e-infrastructure which can be used by researchers worldwide to process and analyse metabolomics and related data. As such, there are ethical, legal and social implications (ELSI) of its use. Importantly, the EU has recently adopted new legislation, the General Data Protection Regulation (GDPR) coming into legal force as of 25th May 2018. The new regulation replaces previous EU data protection legislation, and has additional restrictions and much larger penalties for infringement. PhenoMeNal has been designed to accommodate cases where sensitive data needs to be processed. In particular the key design principles flowing from ELSI considerations are:

  • All users of PhenoMeNal VRE deployed on a public cloud provider must ONLY upload and process FULLY ANONYMISED data. That is, data which could never be used to identify an individual. This is stipulated in the terms of use which users must accept in order to use the system.
  • To process sensitive/identifiable data (e.g. clinical studies) Phenomenal must be installed locally behind an institute/hospital firewall with appropriate ethics approval. This allows the use of all the usual Phenomenal tools without the risk of data leaking into the outside world.
  • Phenomenal users must go through a registration process in which user details are collected and in which they are made aware of a number of ELSI tools to facilitate proper use of the environment:
    • The provision of the Phenomenal Terms and conditions of use, which they are required to read
    • Flowcharts to take the user through an ELSI requirements process for data analysis
    • Guidance on the use of sensitive data
    • A Data provider form (if required) which can be used as a template when requesting use of data from another research group
    • Links to sites offering additional guidance on ELSI considerations
      • Global Alliance for Genomics and Health.
      • BioMedBridges Ethical Governance Framework
      • Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation).

Security Roadmap

This list contains aspects in PhenoMeNal that are relevant from a security perspective. This list is not ordered here but is for public discussion and prioritization into the development process.

  • Set up network between Cloudflare and VRE is with encryption in the KubeNow deployment script.
    • This should be doable without too much work.
  • Containers should be signed and VRE should only allow to install signed containers.
    • It would probably be more for reproducibility purposes than security.
  • Security could be improved if all accepted containers came from the same, secure, hardened base container that is continuously patched.
    • This would be a rather big undertaking and could be disruptive for developers.
  • Only allowing for CI system/user to push to docker registry
  • To strengthen security in Galaxy, use third party authentication systems(Galaxy supports LDAP, OpenID and other providers). Consider also using x509 client certificates.
  • Consider using VPN. This could however impact simplicity of use.
    • Explore commercial cloud-agnostic one-click VPN services.
    • Explore setting up using cloud provider
    • Explore option to use VPN instead of Cloudflare.
  • Consider Singularity when support in Kubernetes is added.
    • Not available yet.
  • Implement offline VREs, no connection to Internet.
    • Substantial undertaking but should be possible.
  • Consider whitelisting containers for workflows.
    • We anticipate that this will be quite resource-demanding to keep in sync. Probably not the road we want to go down.
  • Consider delete/nullify data since cloud providers do not cryptographically delete data after volumes are destroyed.
    • For very sensitive data, this could be an aspect. Should not be too difficult.
  • Set up CIDR rules at the cloud provider that asks traffic to come only proxied from cloudflare.

Security Review

A security discussion/review was conducted between the PhenoMeNal H2020 project (http://phenomenal-h2020.eu/) and NeIC-Tryggve2 (https://neic.no/tryggve2/) experts as well as other invited experts. The objective was to discuss the security approach taken by PhenoMeNal, to evaluate current status, identify the biggest risks, and to offer advice on future activities to strengthen the security in the project components. A summary of the outcome is available: TBC

Clone this wiki locally