The small RNA-Seq pipeline

Summary
Installation
- Create a Conda environment
- Dependencies
Usage
Authors
- Contributors
- Maintainers
Citation
License
Versioning
Acknowledgments
References

Summary

The small RNA-Seq description pipeline is a Snakemake pipeline to annotate small RNA loci (miRNAs, phased siRNAs) using one or more reference genomes and based on experimental small RNA-Seq datasets.
This pipeline heavily relies on the ShortStack software that annotates and quantifies small RNAs using a reference genome.

Upon completion, several outputs will be generated for each sample:

One Shortstack result file called Results.txt. See the description of this file in the Shortstack manual.
Two fasta files for each sample: one fasta file containing the predicted hairpins and one containing the predicted mature microRNAs.
Two blast result files (in tabular format) based on the blast of predicted hairpins and mature miRNAs against mirbase (the version of miRBase is specified in the config file). See the miRBase website for releases.

Installation

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Create a Conda environment

This Snakemake pipeline make use of the conda package manager to install softwares and dependencies.

First, make sure you have conda installed on your system. Use Miniconda3 and follow the installation instructions.
Using conda, create a virtual environment called snakemake to install Snakemake (version 5.4.3 or higher) by executing the following code in a Shell window: conda env create -f environment.yml. This will install snakemake version 5.20.0 and pandas version 0.25.0 in a new environment called small.
Activate this environment using: conda activate small
You can now run the pipeline (see below).

If you have set up conda and created the small environment, that's all you need to do!

Dependencies

Snakemake - The Snakemake workflow management system is a tool to create reproducible and scalable data analyses.
NCBI blast+ - A program to perform sequence similarity search. See NCBI Blast webpage for more info.
ShortStack - Small RNA loci annotation and quantification.
Trimmomatic - Read trimming for NGS data.
bioawk - Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names.

A series of custom Python functions are also used and can be found in the helpers.py file.
Versions of softwares and packages can be seen in their respective environment .YAML file in the envs/ folder.

Usage

Example

A small dataset is available in test/ to run some tests rapidly. It will use the genome and miRBase reference fasta files stored in refs/.
To run the test, open a new Shell window and:

Activate your working environment: conda activate small
Type snakemake -j 1 -np for a dry run. No analysis is run but it checks that the Directed Acyclic Graph of jobs is OK (input and output from each rule chained to each other).
For the real run, type snakemake --cores N where N is the number of CPUs that you want to use (default = 1).

Samples

A samples.tsv file can be used to specify sample names, their corresponding genomic reference to use and the location of their sequencing file.

Configuration

Configuration settings can be changed in the config.yaml file. For instance, one could modify the minimal coverage required by Shorstack to discover sRNA loci.

Genomic references

Different genomic references can be used for each sample. Simply provide a genomic reference corresponding to your sample.

Authors

Contributors

Marc Galland - Initial work - Github profile
Michelle van der Gragt - Initial work - Github profile

Maintainers

Marc Galland - Initial work - Github profile

Citation

...as soon as we have published this software!

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Versioning

SemVer is used for versioning. For the versions available, see the releases on this repository.

Acknowledgments

References

Bioawk tutorial: https://isugenomics.github.io/bioinformatics-workbook/Appendix/bioawk-basics
Vienna RNAfold tutorial: https://www.tbi.univie.ac.at/RNA/tutorial/#sec3
miRTop: from BAM files to GFF3 files (and conversion to other formats such as Fasta etc.): https://academic.oup.com/bioinformatics/article/36/3/698/5556118

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
config		config
crunchomics_profile		crunchomics_profile
fastq		fastq
refs		refs
scripts		scripts
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
Snakefile		Snakefile
environment.yml		environment.yml
helpers.py		helpers.py

License

BleekerLab/small-rna-seq-pipeline

Folders and files

Latest commit

History

Repository files navigation

The small RNA-Seq pipeline

Summary

Installation

Create a Conda environment

Dependencies

Usage

Example

Samples

Configuration

Genomic references

Authors

Contributors

Maintainers

Citation

License

Versioning

Acknowledgments

References

About

Resources

License

Stars

Watchers

Forks

Languages