Skip to content

Research Object Proposal

Stian Soiland-Reyes edited this page Jul 23, 2018 · 74 revisions

This page is outdated, see instead the CWLProv profile.

This is a proposal document for Research Objects to capture provenance of CWL workflow runs, as well as associated metadata.

Draft notes by: @FarahZKhan, @stain

Random link collection (TODO Describe and organize)

  • Previous CWL discussions
    • https://github.com/common-workflow-language/common-workflow-language/issues/84

      @ntijanic Proposed that we specify the properties to extend wfprov to include information about command line and environment setup used for the workflow run. He further suggested we can use EDAM, an ontology of bioinformatics operations, types of data and identifiers, application domains and formats.

      @stain suggested Research Object bundle including all the files generated as a result of a workflow run and copies of CWL workflow and tool descriptions at the time of running. We can also have a discussion about specifications, use and principles, in their gitter channel. Few points from his comment:

      1. cwl:CommandLineTool can be represented as wfdesc:ProcessImplementation
      2. wfdesc:Workflow can be similar to wfdesc:WorkflowDefinitionand the Workflow as a structure can have a common UUID identifier. In addition, when a workflow is used as a nested workflow more than once, we can make an identifier for each of the sub-workflow to match the outer cwl:Step, where both identifiers will refer to same wfdesc:hasWorkflowDefinition. We might face problems with tracking provenance of the sub-workflow runs as two runs will be associated with the same parameter name.
      3. If cwl:Step is a sub-workflow, then the equivalent wfdesc:ProcessRuns gets 'upgraded' to wfdesc:WorkflowRuns which would have inner ProcessRuns linked to it with wfprov:wasPartOfWorkflowRun.

We can also benefit from existing resources such as many experiments represented as research object bundle using ROHUB (Research Object Digital Library) to better capture the required details in the bundle and preserve it for future use. A significant characteristic of ROHUB is keeping track of history of RO to identify the points of difference between two versions of any study. In addition, the quality checks are also worth exploring as this is similar to what we talked about (guiding the user to capture all the necessary domain specific details by providing quality checks of the RO). A key example case study using RO to aggregate data and metadata that enrich the workflow specifications is a workflow based experiment investigating aspects of Huntington’s disease.

This document details the provenance bundle provided as a downloadable object for any Taverna workflow. It contains all the components required to fully understand a workflow run. wfdesc is used to represent the workflow in abstract representation which we can use to generate an abstract visual representation of the workflow for better readability. In addition, the bundle contains a provenance file for that particular workflow run for which the bundle was created. This file uses provenance vocabularies such as PROV-O, wfprov and tavernaprov to capture details of inputs, outputs, intermediate results and parameters. There are four more sub-directories in the bundle named as inputs, outputs, intermediates and .ro containing inputs to the workflow run, outputs produced as a result of workflow execution, intermediate results from the processes of the workflow, manifest and abstract workflow description (mentioned above). This document can definitely be used to build upon and add other provenance related factors such as compute and storage requirements (such requirements can be extracted from CWL description and annotated in the provenance file) to provide more information to the users before they enact the workflow.

Biocompute Objects aim to store and transfer inter operable information about NGS computational analysis (mainly carried out chaining different tools together in a workflow) such as version and parameter setting of the tools in a pipeline, availability/reference of the input and output data for authenticity and verification of the results and the pipeline and other important metadata resulting in improved standards for evaluation, validation and verification of the bioinformatics analysis. Currently everything is saved in one JSON file. We can think of ways to connect to the other files for keeping track of the provenance information. A CWL executable workflow with BCO packaged in an RO will hopefully cover enough information and will be inter operable ans portable as suggested by Stian in this workshop.

Current work produces a RO bundle when provided the link to the workflow specification (an example can be visualized here). The viewer provides visual representation of the workflow specification which includes nodes for sub-workflows as well. The resulting bundle contains the workflow specification, manifest file containing metadata and prospective provenance aspects that can be shared as a zipped folder for later use. This can be extended to track retrospective provenance for the workflow run. In this process, the intermediate files can be tracked along with their checksums and metadata using the file system structure called BagIt for transferring and archiving files.

Notes from Code fest:

To work for development purposes, you can install cwltool in a virtual environment:

  • Install virtualenv via pip: pip install virtualenv
  • Clone the cwltool: git clone https://github.com/common-workflow-language/cwltool.git
  • Switch to cwltool directory: cd cwltool
  • Create a virtual environment: virtualenv cwltool
  • To begin using the virtual environment, it needs to be activated: source bin/activate
  • To check if you have the virtual environment set up: type type and type python
  • Install cwltool in the virtual environment: pip install .
  • Check the version which might be different from the version installed in general on any system: cwltool --version

Structure of Research Object (RO):

Following is the proposed structure of a Research Object:

We can have a basic directory structure as

  • data: The data directory should have three subdirectories as inputs, outputs and intermediates.
  • workflow: The workflow directory should contain the input object (json file) with relativised paths, normalised and centralised executable workflow file and a subdirectory named as main, containing the workflow specification and tool specifications with relativised paths to re-run inside an RO.
  • snapshot: This directory contains copies of the original workflow and tool specifications as-is (thus might contain absolute paths or be host-specific).
  • metadata: The metadata directory contains mainly provenance about the workflow run, its data products and manifest for this Research Object. The provenance subdirectory should contain at least two files, job-environment.jsonld and workflow-execution.jsonld. job-environment.jsonld should capture arguments to cwltool actual command line and might contain absolute paths to the files (also copied in the snapshot). workflow-execution.jsonld should contain the information about the workflow execution associated with this RO. The manifest.json file must list all the resources aggregated by this RO which may contain external resources like docker images.

Level 1 provenance

Potential attributes/elements of the workflow-execution.jsonld provenance document:

We expect that each workflow run at least contains two files that is, the cwl workflow description file and json input object.

  • wfprov:WorkflowRun requires a runID for every workflow execution. This could be achieved by generating a UUID.
  • wfprov:describedByWorkflow can point to the relativised path for workflow specification aggregated in the RO.
  • wfprov:ProcessRun requires an ID which can be associated to the nested sub-workflow if the process/step consists of a sub-workflow. Every ProcessRun/ProcessExecution will use a set of inputs of possibly various data types. In case of input files or default arguments requiring file object, these should be included into the RO for that run. For all the other input arguments such as string, float, int etc. the ID can be mapped from input object json file to the prov document.
  • wfprov:waspartofWorkflowRun is easy to capture if we decide to have a UUID for the workflow run.
  • wfprov:describedbyProcess can either be a cwl underlying tool description or it can be simply a step without any underlying tool description but for the cases when the step/process does not require any underlying tool description, this value should be declared null or equivalent.
  • wfprov:Artifact Each artifact in the workflow should be given an ID so that it can be linked to more than one processes if required.
  • wfprov:describedbyParameter can have a link to the workflow input object and the workflow description for the exact value and type respectively.

PROV-N

Attempt to model the above as PROV-N with a couple of wfprov terms: Example workflow run provenance

Issues:

  • unable to run packed.cwl when tried executing as ../../cwltool.py --debug --provenance fred packed.cwl master-job.json where packed.cwl is the normalized workflow specification file and master-job.jsob is the newly created relativised job file, it generated an error as
file:/cwltool/fred/workflow/packed.cwl#revtool.cwl

This is probably because of the fact that provenance component is trying to pack already packed cwl workflow specification.

Resolved: When a packed cwl file is executed, the command is as follows:

../../cwltool.py --debug --provenance fred packed.cwl#main master-job.json

Existing RO:

The existing RO contains the relativised inputs, outputs, packed CWL workflow file and master job file containing relative path to the files in the RO. There is a limitation yet to be resolved in case of outputs. Currently we are only collecting and storing the file type outputs whereas there might be a workflow producing a different format of output. That will also be captured as we go ahead possibly by writing that particular object in a file.

SHA1 OR SHA256 for PROV document: Currently we are using SHA1 hash to relativise the paths of the input and output files with respect to an RO but there has been an interesting discussion here about hash collisions so possibly we can consider sha256 or stronger in future.The existing cwltool implementation provides sha1 hash checksums for the file objects which will be required to be converted to sha256 in order to be consistent for all the files in the RO including data files and job files.

Getting started with PROV documents: We can start with prov Python packageas format to write this document. The sample PROV document given above is using sha256 identifiers so as mentioned earlier we can relate that to data in the RO (relativised using sha1 hash).

Here is a sample PROV document which is using different specifications such as PROV-N, wfdesc and wfprov.

Notes from CoFest 2018

Below is the list of metadata which CWL-metrics currently uses. @inutano wants CWL-prov to have them as well so that CWL-metrics can extract the metadata from cwltool --provenance output, and put the metrics data back in the Research Object.

  • Workflow
    • StartDate
    • EndDate
    • CWL filename
  • Steps
    • Step name
    • CWL filename
    • Container ID
    • Tool status (success of permanentFail)
    • List of File type input and their sizes (including intermediate files)
    • List of File type output and their sizes

Current implementation of CWL-metrics parses cwltool --debug stdout to get those information, thus it forces users to use many options on running cwltool.