Bioinformatics workflows with Spark and Reflow: an example

This project showcases a simple bioinformatics workflow. The end product is a heatmap of normalized gene expressions across related samples. The workflow features Spark and Reflow heavily. The approach developed here is suitable for both small sample series and large-scale meta-analyses.

The workflow starts from a series of RNAseq reads fetched from the Sequence Read Archive (SRA, NCBI). The reads are aligned on a reference genome from Ensembl using hisat2, the alignments are sorted and indexed using samtools, and the sorted alignments are assembled into potential transcripts using stringtie. We do not perform any quality control. The gene abundances given by stringtie are normalized using the quantile method of the limma bioconductor package, and the heatmap is produced with ComplexHeatmap.

Overall, the workflow is far from authoritative, there are many other ways to approach feature counting, and it should only be viewed as an example of how to conduct bioinformatics exploratory analysis.

Spark is used to do calculation on dataframes that can be massive. Scala is used for its type system and the JVM ecosystem, particularly suited to network calls and interacting with third-party APIs (such as the NCBI or Ensembl APIs). R is used for its rich ecosystem of statistical techniques. Reflow is used to coordinate interrelated jobs on AWS S3. The whole analysis, from feature counting to producing a heatmap of normalized gene expressions, lies in a single Polynote notebook, ./RnaSeqExample.ipynb

Development

Unit and integration tests:

sbt test

Code formatting:

sbt scalafmtAll

License

exploratory-pipeline-example is licensed under The MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.circleci		.circleci
bioinformatics/src		bioinformatics/src
doc		doc
project		project
spark-r/src		spark-r/src
spark-reflow/src		spark-reflow/src
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
README.md		README.md
RnaSeqExample.ipynb		RnaSeqExample.ipynb
build.sbt		build.sbt
packages.R		packages.R
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.circleci

.circleci

bioinformatics/src

bioinformatics/src

doc

doc

project

project

spark-r/src

spark-r/src

spark-reflow/src

spark-reflow/src

.gitignore

.gitignore

.scalafmt.conf

.scalafmt.conf

LICENSE

LICENSE

README.md

README.md

RnaSeqExample.ipynb

RnaSeqExample.ipynb

build.sbt

build.sbt

packages.R

packages.R

renovate.json

renovate.json

Repository files navigation

Bioinformatics workflows with Spark and Reflow: an example

Development

License

About

Releases

Packages

Languages

License

hchauvin/exploratory-pipeline-example

Folders and files

Latest commit

History

Repository files navigation

Bioinformatics workflows with Spark and Reflow: an example

Development

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages