Skip to content

BleekerLab/small-rna-seq-pipeline

Repository files navigation

The small RNA-Seq pipeline

Summary

The small RNA-Seq description pipeline is a Snakemake pipeline to annotate small RNA loci (miRNAs, phased siRNAs) using one or more reference genomes and based on experimental small RNA-Seq datasets.
This pipeline heavily relies on the ShortStack software that annotates and quantifies small RNAs using a reference genome.

Upon completion, several outputs will be generated for each sample:

  • One Shortstack result file called Results.txt. See the description of this file in the Shortstack manual.
  • Two fasta files for each sample: one fasta file containing the predicted hairpins and one containing the predicted mature microRNAs.
  • Two blast result files (in tabular format) based on the blast of predicted hairpins and mature miRNAs against mirbase (the version of miRBase is specified in the config file). See the miRBase website for releases.

Installation

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Create a Conda environment

This Snakemake pipeline make use of the conda package manager to install softwares and dependencies.

  1. First, make sure you have conda installed on your system. Use Miniconda3 and follow the installation instructions.
  2. Using conda, create a virtual environment called snakemake to install Snakemake (version 5.4.3 or higher) by executing the following code in a Shell window: conda env create -f environment.yml. This will install snakemake version 5.20.0 and pandas version 0.25.0 in a new environment called small.
  3. Activate this environment using: conda activate small
  4. You can now run the pipeline (see below).

If you have set up conda and created the small environment, that's all you need to do!

Dependencies

  • Snakemake - The Snakemake workflow management system is a tool to create reproducible and scalable data analyses.
  • NCBI blast+ - A program to perform sequence similarity search. See NCBI Blast webpage for more info.
  • ShortStack - Small RNA loci annotation and quantification.
  • Trimmomatic - Read trimming for NGS data.
  • bioawk - Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names.

A series of custom Python functions are also used and can be found in the helpers.py file.
Versions of softwares and packages can be seen in their respective environment .YAML file in the envs/ folder.

Usage

Example

A small dataset is available in test/ to run some tests rapidly. It will use the genome and miRBase reference fasta files stored in refs/.
To run the test, open a new Shell window and:

  1. Activate your working environment: conda activate small
  2. Type snakemake -j 1 -np for a dry run. No analysis is run but it checks that the Directed Acyclic Graph of jobs is OK (input and output from each rule chained to each other).
  3. For the real run, type snakemake --cores N where N is the number of CPUs that you want to use (default = 1).

Samples

A samples.tsv file can be used to specify sample names, their corresponding genomic reference to use and the location of their sequencing file.

Configuration

Configuration settings can be changed in the config.yaml file. For instance, one could modify the minimal coverage required by Shorstack to discover sRNA loci.

Genomic references

Different genomic references can be used for each sample. Simply provide a genomic reference corresponding to your sample.

Authors

Contributors

Maintainers

Citation

...as soon as we have published this software!

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Versioning

SemVer is used for versioning. For the versions available, see the releases on this repository.

Acknowledgments

References

About

A pipeline to annotate miRNAs, phased siRNAs and other types using a reference genome and experimental sRNA-Seq data

Resources

License

Stars

Watchers

Forks

Packages

No packages published