Skip to content

Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq

Repository files navigation

dockstore-workflow-rnaseq

A workflow for processing bulk RNA Sequencing datasets.

Description

This workflow automates and standardizes the processing of bulk RNASeq datasets. The following steps are performed:

CWL

Three main workflows are present in the root of this repository:

Subworkflows that the main workflows utilize are present in the subworkflows folder.

cwltool execution

The run-cwltool.sh script can be used to execute a workflow on a single compute instance using cwltool. Two arguments must be provided:

  • A path to your job directory
  • The main workflow file that you want to run

For example, to run the paired-end BAM workflow, you can execute the following command from the base directory:

./utils/run-cwltool.sh jobs/test-paired-bam bam_paired.cwl

toil execution

Toil is a workflow engine that can execute CWL workflows in the cloud or other compute infrastructures. We have provided a script that can be used to submit workflows on a Toil Cluster in AWS. To run the script:

  • ssh to toil cluster leader node
  • from this directory (presuming the git repo was cloned to the leader),
  • choose a job directory, for example, jobs/test-paired-bam
  • execute toil run script: ./utils/run-toil.py jobs/test-paired-bam

Run ./utils/run-toil.py -h to see more options. Note that there is a --dry-run option, which can help you to become familiar with the tool.

How to Add More Jobs

To add a new job, create a new directory under jobs.

Each job directory requires an options.json, the set of options used by toil. The options.json in jobs/default contains default options. Additional ones can be added (or overwritten) in your job directory's options.json. The run-toil.py script will warn you if any are missing.

Each job directory also requires a job.json. This contains the arguments that are supplied to the CWL that you specify in your options.json.

For examples of both options.json and job.json, see jobs/test-paired-bam.

Required Job Inputs

Each workflow requires the following inputs:

  • cwl_wf_url: A URL that points to a commit or tagged version of this github repository at the time of job submission. "https://github.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq/tree/5832931a9569d9d8fba26a36146a682870d6f5f7", for example. Guidance on generating a permanent github link can be found here.
  • cwl_args_url: A raw github URL that points to the input parameters file for the job that you are running. "https://raw.githubusercontent.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq/5832931a9569d9d8fba26a36146a682870d6f5f7/jobs/test-paired-bam/job.json", for example. To find the raw URL for a file on github, navigate to the file and follow the instructions for generating a permanent url. You can then click on the raw button to open the raw URL in your browser.
  • index_synapseid: A Synapse ID for the folder that contains a STAR-indexed reference genome. An example can be found in syn22152278. Two gtf files must be presesnt in this folder to run the mirna_single.cwl workflow: A main gtf file with the filename extension ".annotation.gtf" and a gtf file that contains only miRNA annotations with the filename extension ".subset.gtf". An example of a miRNA-compatible reference genome folder can be found in syn22342700
  • nthreads: An integer value that represents the number of compute threads that the STAR aligner should use.
  • synapse_parentid: A Synapse ID for the folder that output tables will be uploaded to.
  • synapse_config: A Synapse configuration file that will be used to authenticate data downloads and uploads during workflow execution
  • synapseid: List of Synapse ID's that correspond to input reads for processing. For the bam_paired.cwl workflow, the ID's should point to BAM files. For the fastq_paired.cwl workflow, these ID's should point to compressed fastq files for the forward reads. For the fastq_single.cwl workflow, these ID's should point to compressed fastq files. These files must contain a specimenID annotation in Synapse

The fastq_paired.cwl workflow also requires the following input:

  • synapseid_2: A list of Synapse ID's that correspond to the reverse reads in compressed fastq.gz format. These files must contain a specimenID annotation in Synapse, and this list should be ordered by specimen to match the synapesid list.

An example input json file that contains values for these required inputs can be found here

An example input json file that contains example parameters for the mirna_single.cwl workflow can be found here

Optional Job inputs

You can optionally supply an input parameter that specifies the strandedness parameter of the library that will be used by Picard Tools. To do so, add the strand_specificity argument to your job.json file. The three valid string options for this parameter are:

  • NONE
  • FIRST_READ_TRANSCRIPTION_STRAND
  • SECOND_READ_TRANSCRIPTION_STRAND

If this argument is not provided, the default value of NONE will be used.

To specify the column parse from STAR gene count output, specify the column_number parameter. The three valid integer arguments are:

  • 2 : counts for unstranded RNA libraries
  • 3: counts for first read stranded libraries
  • 4: counts for second read stranded libraries

If this argument is not provided, the default value of 2 will be used. This is the correct value for libraries that are not specifically designed to be stranded.

An example input json file that contains the required inputs and these optional inputs can be found here

In addition, you may optionally specify the following parameters for the STAR alignment (Note that it is highly recommended to customize these arguments for the mirna_single.cwl workflow):

  • alignEndsType : A string specifying the type of read ends alignment
  • outFilterMismatchNmax : Integer specifying the maximum number of mismatches per pair
  • outFilterMultimapScoreRange : Integer specifying the score range for multi-mapping alignments
  • outFilterMultimapNmax : Integer specifying the maximum number of multiple alignments for a read
  • outFilterScoreMinOverLread : Integer specifying the minimum score for an alignment to be reported, normalized to read length
  • outFilterMatchNminOverLread : Integer specifying the minimum number of matched bases for an alignment to be reported, normalized to read length
  • outFilterMatchNmin : Integer specifying the minimum number of matched bases for an alignment to be reported
  • alignSJDBoverhangMin : Integer specifying the minimum block size for annotated spliced alignments
  • alignIntronMax : Integer specifying the maximum intron size

For further details about these parameters, please refer to the STAR manual

Resource Requirements

Resource requirements are specified using the CWL ResourceRequirement class. Each subworkflow contains specific requests for RAM, disk space, and number of threads. These values are set for average-sized RNA Sequencing input files for alignment against the human reference genome. If the default values are not sufficient, please modify the ResourceRequirement values in the subworkflow CWL files.

Workflow Outputs

The following output files are uploaded to Synapse during workflow execution:

  • gene_all_counts_matrix.txt: Table containing raw gene counts, where row labels are geneid's and column labels are synapseid's for input files
  • gene_all_counts_matrix_clean.txt: The same gene count table, but synapseid's have been converted to specimenID's and any duplicate samples were removed
  • Star_Log_Merged.txt: Table containing mapping statistics that were parsed from the STAR aligner log files
  • Star_Log_Merged_clean.txt: The same mapping statistics table, but synapseid's have been converted to specimenID's and any duplicate samples were removed
  • Study_all_metrics_matrix.txt: A table containing statistics about realigned BAM files, as generated by picardtools
  • Study_all_metrics_matrix_clean.txt: The same table containing realigned BAM file statistics, but synapseid's have been converted to specimenID's and any duplicate samples were removed
  • provenance.csv: A csv file where the first field contains the synapseid's for all input files that were used in the processing workflow, the second field contains the version of the synapseid that was used, and the third field contains the corresponding specimenID from the Synapse annotation

Tests

cwltest is used for testing. Add test descriptions to tests/test-descriptions.yaml. Each test added requires a file describing the job inputs that should be added to the tests directory.

Integration Tests

Integration tests are automatically performed on any push to the master branch that does not contain the [skip-ci] string in the commit message. Test data is stored on a project in Synapse, and is accessed using a service account that has credentials stored as github secrets in this repository.

Continuous Deployment and Versioning

This repository uses GitHub actions to run tests and perform automated versioning.

CI

Defined in .github/workflows/ci.yaml, this action runs on each push to master where the commit does not contain '[skip-ci]'.

Versioning

Versioning is achieved through git tagging using semantic versioning. Each push to master will generate an increment to the patch value, unless the commit contains the string '[skip-ci]'.

Use the release script to do a minor or major release. To create a minor release, run python utils/release.py from the project root. To create a major release, run the same command but add the flag --major.

The release script has dependencies which can be installed to virtual environment using pipenv. After installing pipenv, run pipenv install to install the dependencies, and pipenv shell to activate the environment.

Alternately, to do a minor or major releases manually:

  1. Determine what the tag value will be. For example, to make a minor release from v0.1.22, the next tag would be v0.2.0.
  2. In the CWL tools, change the docker version to use that tag, and create a commit like "Update docker version in cwl tools in preparation for minor release"
  3. Run the tagging commmand: git tag v0.2.0
  4. Push the tag: git push --tags

Branch Versioning

Optionally, you can set up your repository for running the CI action on pushes to all branches, not just master. This is not the default behavior because it introduces complexity and requires that you use git in a certain way.

To set this up, in .github/workflows/ci.yaml, change master to '*' in the event filter ( on > push > branches). This will cause pushes to non-master tags to also build. They will be tagged with this pattern: -, e.g. v1.0.0-197e187.

If you choose to make this change, for best results we recommend that you also use the no-fast-forward flag (--no-ff) when merging branches to master. Using that flag will ensure that a new merge commit is created, and CI will run correctly. Without a new merge commit, versioning won't work correctly.