Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Case: Describe/include software containers #39

Open
stain opened this issue Aug 12, 2019 · 9 comments
Open

Use Case: Describe/include software containers #39

stain opened this issue Aug 12, 2019 · 9 comments
Labels
use-case A (potential) use-case for ROLite creation, consumption or integration

Comments

@stain
Copy link
Contributor

stain commented Aug 12, 2019

As an open science researcher, I want to provide Docker/Singularity container images so that others can reliably reproduce my results or reuse the same software.

This implies that the container images and their recipes (e.g. Dockerfile) should be included in the RO-Crate and typed as such, so users know they can be executed.

It is desirable also to use tooling to expand the description with a list of dependencies installed in the container this will help provide light-weight software citations.

Related efforts to align with:

@stain stain added the use-case A (potential) use-case for ROLite creation, consumption or integration label Aug 12, 2019
@stain stain changed the title Use Case: ... Use Case: Describe software containers Aug 12, 2019
@stain stain changed the title Use Case: Describe software containers Use Case: Describe/include software containers Aug 12, 2019
@stain
Copy link
Contributor Author

stain commented Aug 12, 2019

Example descriptions generated by extract-dockerfile

From a Dockerfile we describe a ContainerRecipe (specializes SoftwareSourceCode

{
    "@context": "http://www.schema.org",
    "@type": "ContainerRecipe",
    "name": "vsoch/salad",
    "description": "A Dockerfile build recipe",
    "containerImage": "gliderlabs/alpine:3.4",

    "labels": [
        [
            "MAINTAINER toasterlint \"henry@toasterlint.com"
        ]
    ],
    "environment": [
        "RPCPORT=4000"
    ],
    "entrypoint": [
        "/entrypoint"
    ],
}

(see openschemas/specifications#10)

From a Docker image we describe a ContainerImage:

{
    "environment": [
        "SRC_DIR=/go/src/github.com/vsoch/salad/"
    ],
    "entrypoint": [
        "/code/salad"
    ],
    "description": "A Dockerfile build recipe",
    "name": "vanessa/sregistry",
    "ContainerImage": "iron/go:dev",
    "operatingSystem": "linux",
    "softwareVersion": "sha256:8d1e7f244db9e7cb85d5867bb3230f756460900e5801ff2303e44a79369640f4",
    "identifier": [
        "vanessa/sregistry:latest"
    ],
    "url": "https://hub.docker.com/r/vanessa/sregistry",
    "alternateName": "Singularity Registry",
    "softwareHelp": "https://singularityhub.github.io/sregistry",
    "citation": "http://joss.theoj.org/papers/050362b7e7691d2a5d0ebed8251bc01e",
    "license": "https://github.com/singularityhub/sregistry/blob/master/LICENSE",
    "keywords": "container, containers, singularity, singularity registry",
    "softwareRequirements": [
        "Pip > xmlsec==1.3.3"
    ],
    "@context": "http://www.schema.org",
    "@type": "ImageDefinition"
}

Above extract-dockerfile has actually extracted the softwareRequirements of pip installs from inside the container.

(however this type is called ContainerImage rather than ImageDefinition so some stability with upstream specs would be needed - see openbases/extract-dockerfile#6)

@vsoch
Copy link

vsoch commented Aug 12, 2019

See discussion in openbases/extract-dockerfile#6 - there was some discussion over the name, my preference is for what is represented in https://openschemas.github.io/specifications/ because (as you correctly bring up) an ImageDefinition could refer to other kinds of images, but ContainerImage is more clear.

@dgarijo
Copy link
Contributor

dgarijo commented Aug 12, 2019

This is interesting! Would this need to be related to cwl as well? (which defines how to invoke the image as opposed to the definition of the image itself)

In Dockerpedia they have done a thorough extraction of images, although it's not aligned with schema. Maybe we can use their service for extraction too. An example: https://dockerpedia.inf.utfsm.cl/resource/SoftwareImage/dockerpedia-pegasus_workflow_images_latest

@vsoch
Copy link

vsoch commented Aug 12, 2019

I don't think it would be wise to "hard code" (so to speak) any particular workflow manager or description (e.g., cwl, snakemake, nextflow) directly into the specification. On the other hand, if there is an appropriate field to describe this same entity, it would be logical to include (e.g., if I find that it's snakemake, I should look for a Snakefile somewhere...)

For CWL, is there a definitive specification for interaction? For example, for a scif container, you can be absolutely sure how to discover applications inside (singularity run container.sif apps) and then how to run / inspect / shell / otherwise interact with an application you just found (e.g., singularity run container.sif run <app>.

@dgarijo
Copy link
Contributor

dgarijo commented Aug 12, 2019

CWL has a field for pulling from a docker container. Maybe that could be the hook.
My point is not necessarily to use a particular workflow spec. What I want to record is how the app in the container is supposed to be invoked and how to pass on the files. Since cwl describes this, it could be a starting point

@vsoch
Copy link

vsoch commented Aug 12, 2019

Yes, understood! To be more clear, there are many different tools that describe in a structured way how a container (or app inside) is supposed to be invoked. Actually, those two things are different - cwl could describe an app in a container (and it would have to be provided via the entrypoint so the user could run it to find it) while SCIF describes how to invoke the container itself (of which cwl could be one or more entrypoints).

But from how you describe it - that there is a field for pulling the container, this sounds like it would need to be stored outside of the container, which is another point to discuss. SCIF is a specification that describes standard interaction with a container, and is installed inside the container, along with the SCIF filesystem and other metadata files that are defined for each app.

@craig-willis
Copy link

This is a necessary use case for Whole Tale. A few questions:

  • What about RO-Crates with repo2docker compatible configurations?
  • In the case of a Docker image, is the idea that the RO-Crate would contain a tar archive of the image or a reference to the image in a registry (or either)?
  • While not containers in the same sense, sciunit and reprozip also produce re-executable packages that could be parts of RO-Crates. Are these in scope?

@vsoch
Copy link

vsoch commented Oct 5, 2019

Having a repo2docker configuration is an interesting and useful idea, but I think it would be done in addition to a container recipe - repo2docker in and of iteself doesn't translate to reproducibility - it just means that (assuming a version of repo2docker is available) you could build a container for it. You can think of it like an extra layer to essentially create a Dockerfile (that could be built). It also assumes a user "joyvan" that when converted to Singularity (e.g., for use on HPC) makes things a bit challenging because of the cardinal rule "the user inside the container is the user outside the container."

Re-reading what @stain mentioned - it sounds like he wants the full container, in which case Docker wouldn't be as feasible as it means layers that need to be assembled and require the Docker daemon. A Singularity (sif) binary would be more reasonable, albeit large, and still require Singularity to run. It's really the case that any level of recipe without the container runs the risk of not being able to be built, so probably providing the container somewhere is needed. In the case of Singularity, the recipe file is kept inside the container as well. In the case of Docker, the recipe (and other metadata) would serve as an external way to peep inside without invoking the container.

I'm not super familiar with RO-crates, but reading the description:

RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata.

it does sound like a wrapper (with metadata) to a container is wanted? The container, considered as some kind of data, could also fit into the specification, and as @stain showed, metadata could be extracted for the jsonld.

@jmfernandez
Copy link
Contributor

Re-reading what @stain mentioned - it sounds like he wants the full container, in which case Docker wouldn't be as feasible as it means layers that need to be assembled and require the Docker daemon.

Indeed, you can generate with docker save a tar file with the different layers from one or more tagged docker images, which can be used later to generate a singularity image with singularity import.

I also agree the container recipe is worth to be saved (or referenced plus a fingerprint), as the base image of the recipe could contain a bug, and you would like to re-create it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
use-case A (potential) use-case for ROLite creation, consumption or integration
Projects
None yet
Development

No branches or pull requests

5 participants