CardioPipeLine

This is a simple utility to analyse RNA-seq data with a selection of popular tools. It is written in snakemake that allows the user to scale it to any system, including clusters and cloud computing architectures. Currently, it supports analyses with the following tools;

STAR (genome-based mapping)
Salmon (transcript-based mapping)
Kallisto (transcript-based mapping)
SortMeRNA (rRNA filtering)
StringTie (quantification of BAM files)
NCBI-tools (downloading and conversion from NCBI server)
DESeq2 (differential expression)
MultiQC (quality control report)
FastQC (quality control for reads)
Trinity (de-novo transcript assembly)
bowtie2 (de-novo alignment)

In short, given a public run ID from the SRA archives (SRRXXXXXXX), the pipeline can download the raw data from the NCBI server, convert the files to fastq format, filter the reads, generate indices, map to genome/transcript, and quantify the mapped reads. In addition, it can also carry out differential expression analysis.

However, the user can provide their own FASTA or SRA files for the analysis. In order to modify the default options, refer to the appropriate rules in the respective rules/filename.smk files. Please read wiki for more details.

Dependencies

python3+ (tested on Python3.7 and 3.8)
snakemake (workflow management)
pandas
R 4+ (for DESeq2 analysis; tested on R 4.0.2)
tximport, DESeq2, AnnotationDbi, Rsubread (for R related analyses)

All the above-mentioned dependencies have been successfully tested on Ubuntu 18.04.5 , Gentoo 5.4.48, Fedora 32.

Installation

For installation purposes, it is a good idea to make a new python virtual environment and install snakemake on it, and then download all the tool binaries in your desired location. Following this, retrieve the repository using git;

git clone https://github.com/rohitsuratekar/CardioPipeLine.git

Alternatively, download the repository in the desired location.

The only other modification that needs to be done is to edit the config/config.yaml and config/samples.csv files to specify the location of files on your system, and then run following in the repository folder

snakemake --core all

The above mentioned command will essentially carry out all the steps for you, from downloading the file to quantification, while you sit back and relax :)

Tasks

This pipeline carries out tasks. However, you can always select which tasks you want :)

Task No	Details	Tool Used	Input ^#	Output^{#, *}
0	All tasks	--	--	--
1	Download `.sra` file	`prefetch`	samples.csv	srr.sra
2	Convert to `fastq`	`fasterq-dump`	srr.sra	srr.fastq
3	rRNA filtering	`sortmerna`	srr.fastq	srr.filtered.fastq
4	STAR indexing	`star`	genome.fa, anno.gtf	star-idx
5	STAR mapping	`star`	star-idx, srr.filtered.fastq / srr.fastq	srr.bam
6	Quantification	`stringtie`	srr.bam, anno.gtf	st.tsv
7	Salmon indexing	`salmon`	anno.gtf	salmon-idx
8	Salmon mapping	`salmon`	salmon-idx, srr.filtered.fastq / srr.fastq	quants.sf
9	Kallisto indexing	`kallisto`	anno.gtf	kallisto-idx
10	Kallisto mapping	`kallisto`	kallisto-idx, srr.filtered.fastq / srr.fastq	abundance.tsv
11	Count Matrix StringTie	`prepDE.py`	st.tsv	stringtie.counts
12	Count Matrix Star	`Rsubread`	ssr.bam	star.counts
13	Count Matrix Salmon	`tximport`	quants.sf	salmon.counts
14	Count Matrix Kallisto	`tximport`	abundance.tsv	kallisto.counts
15	Differential Analysis	`DESeq2`	*.counts	con1_vs_con2.csv
16	Quality Control Report	`fastqc`	--	srr.html
17	Quality Control Report	`multiqc`	--	quality_report.html
18	de-novo Assembly	`trinity`	srr.fastq	trinity.fasta
19	de-novo Alignment	`bowtie2`	trinity.fasta	stat.txt

^{* There might be many other related outputs.} ^{# Names are for representative purpose. Actual names will be different}

Selection of tasks

You have full control of which tasks to run. Your can run desired tasks by editing tasks parameter in the config.yaml file. If you want to perform specific task, you need to provide its input file in appropriate location.

But do not worry, thanks to snakemake, if you did not provide appropriate input files, it will search far task which outputs these files and will automatically run for you.

By default tasks: 0 is set. This will run all available tasks.

Setting up the workflow

Editing config.yaml and samples.csv is probably the most important aspect of this pipeline. These are files which will change for each user according to their system and work.

config.yaml

If you are fine with default options of tools used in this workflow, you can just edit this file and run the pipeline with snakemake. Just read the comment in config/config.yaml file and you will know what to do :) Comments are self explanatory.

Config file has parameter called base which provides base folder for all the analysis. Every output file crated with this pipeline can be found in this base folder. Overall output file structure is shown below. More detailed file structure can be found here.

base
├── sra
├── fastq
├── filtered
├── bams
├── index
├── mappings
├── deseq2
└── logs

samples.csv

This file will contain names of all the samples which you want to analyse with this pipeline. You should provide SRA Runs IDs (usually in SRRXXXXXXX format ). You can find these IDs for publically available datasets from GEO Dataset website. If you have your own fastq files, you can check naming convention and rename those files and keep in appropriate folder. and start the pipeline.

This .csv (comma delimited) file SHOULD have header row and SHOULD contains at least following headers.

run : Run ID
is_paired : true (if you are handling paired end sample), false (if it is single end sample)

Example file content

run,is_paired
SRR0000001,true
SRR0000002,false

If you are performing any DESeq2 analysis, third column specifying conditions is mandatory

run,is_paired,condition
SRR0000001,true,wt
SRR0000002,false,mt

Here third column condition will be used in DESeq2 analysis. You should update appropriate parameters in config.yaml file. For example, Let us assume following sample file

run,is_paired,time
SRR0000001,true,48
SRR0000002,false,24

In above file, as third column name is time and 24 is your reference condition then you should update following in the config.yaml

deseq2:
    design_column: "time"
    reference: "24"

In addition, you should also update design parameter from config.yaml file. For example, in above situation, it can be design: "~ time"

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
config		config
models		models
rules		rules
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

models

models

rules

rules

scripts

scripts

.gitignore

.gitignore

README.md

README.md

Snakefile

Snakefile

main.py

main.py

Repository files navigation

CardioPipeLine

Dependencies

Installation

Tasks

Selection of tasks

Setting up the workflow

config.yaml

samples.csv

About

Languages

rohitsuratekar/CardioPipeLine

Folders and files

Latest commit

History

Repository files navigation

CardioPipeLine

Dependencies

Installation

Tasks

Selection of tasks

Setting up the workflow

config.yaml

samples.csv

About

Topics

Resources

Stars

Watchers

Forks

Languages