Skip to content

hchauvin/exploratory-pipeline-example

Repository files navigation

Bioinformatics workflows with Spark and Reflow: an example

CircleCI scala: 2.12 spark: 3.0 License: MIT

This project showcases a simple bioinformatics workflow. The end product is a heatmap of normalized gene expressions across related samples. The workflow features Spark and Reflow heavily. The approach developed here is suitable for both small sample series and large-scale meta-analyses.

Heatmap

The workflow starts from a series of RNAseq reads fetched from the Sequence Read Archive (SRA, NCBI). The reads are aligned on a reference genome from Ensembl using hisat2, the alignments are sorted and indexed using samtools, and the sorted alignments are assembled into potential transcripts using stringtie. We do not perform any quality control. The gene abundances given by stringtie are normalized using the quantile method of the limma bioconductor package, and the heatmap is produced with ComplexHeatmap.

Overall, the workflow is far from authoritative, there are many other ways to approach feature counting, and it should only be viewed as an example of how to conduct bioinformatics exploratory analysis.

Spark is used to do calculation on dataframes that can be massive. Scala is used for its type system and the JVM ecosystem, particularly suited to network calls and interacting with third-party APIs (such as the NCBI or Ensembl APIs). R is used for its rich ecosystem of statistical techniques. Reflow is used to coordinate interrelated jobs on AWS S3. The whole analysis, from feature counting to producing a heatmap of normalized gene expressions, lies in a single Polynote notebook, ./RnaSeqExample.ipynb

Development

Unit and integration tests:

sbt test

Code formatting:

sbt scalafmtAll

License

exploratory-pipeline-example is licensed under The MIT License.

FOSSA Status