Genotyping by sequencing (DNA- or RNA-seq data)

Description
Installation and usage (local machine)
Installation and usage (HPC cluster)
Directed Acyclic Graph of jobs
References 📗
Citation

Description

A Snakemake pipeline that calls SNPs (Single Nucleotide Polymorphisms) from either DNA-seq or mRNA-seq data. It trims and aligns DNA/mRNA-seq fastq to a reference genome, then call SNPs before computing the number of SNPs per genomic window (e.g. per 1Mb). This pipeline can process single or paired-end data and is mostly suited for Illumina sequencing data.

Steps

The raw fastq files will be trimmed for adaptors and quality checked with fastp.
The genome sequence FASTA file will be used for the mapping step of the trimmed reads using either bwa or STAR.
The reference genome will be sliced into bins of a pre-determined size (e.g. 1Mb).
SNPs are called based on the alignment .bam files generating one VCF file per sample.
SNPs are filtered based on a read depth threshold and SNP quality.
The obtained filtered VCF file(s) are converted to the BED format .bed to allow computation.
The number of overlapping SNPs per genome bin is computed using BEDOPS bedmap operator.

Input files

DNA or RNA-seq fastq files as listed in the config/samples.tsv file. Specify a sample name (e.g. "Sample_A") in the sample column and the paths to the forward read (fq1) and to the reverse read (fq2). If you have single-end reads, leave the fq2 column empty.
A genomic reference in FASTA format. For instance, a fasta file containing the 12 chromosomes of tomato (Solanum lycopersicum).
(for RNA-seq datasets) A genome annotation file in the GTF format. You can convert a GFF annotation file format into GTF with the gffread program from Cufflinks: gffread my.gff3 -T -o my.gtf. ⚠️ for featureCounts to work, the feature in the GTF file should be exon while the meta-feature has to be transcript_id. This will be converted to a BED file to compute the number of genes per genomic bin.

Below is an example of a GTF file format. ⚠️ a real GTF file does not have column names (seqname, source, etc.). Remove all non-data rows.

seqname	source	feature	start	end	score	strand	frame	attributes
SL4.0ch01	maker_ITAG	CDS	279	743	.	+	0	transcript_id "Solyc01g004000.1.1"; gene_id "gene:Solyc01g004000.1"; gene_name "Solyc01g004000.1";
SL4.0ch01	maker_ITAG	exon	1173	1616	.	+	.	transcript_id "Solyc01g004002.1.1"; gene_id "gene:Solyc01g004002.1"; gene_name "Solyc01g004002.1";
SL4.0ch01	maker_ITAG	exon	3793	3971	.	+	.	transcript_id "Solyc01g004002.1.1"; gene_id "gene:Solyc01g004002.1"; gene_name "Solyc01g004002.1";

Output files

One .tsv file per sample summarising the number of counts per genome bin called [sample name].counts.tsv. This table is used for plotting.
fastp QC reports: one per fastq file.

Prerequisites: what you should know before using this pipeline

Some command of the Unix Shell to connect to a remote server where you will execute the pipeline. You can find a good tutorial from the Software Carpentry Foundation here and another one from Berlin Bioinformatics here.
Some command of the Unix Shell to transfer datasets to and from a remote server (to transfer sequencing files and retrieve the results/). The Berlin Bioinformatics Unix begginer guide available here) should be sufficient for that (check the wget and scp commands).
An understanding of the steps of a canonical RNA-Seq analysis (trimming, alignment, etc.). You can find some info here.

Content of this GitHub repository

Snakefile: a master file that contains the desired outputs and the rules to generate them from the input files.
config/samples.tsv: a file containing sample names and the paths to the forward and eventually reverse reads (if paired-end). This file has to be adapted to your sample names before running the pipeline.
config/config.yaml: the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules.
config/refs/: a folder containing
- a genomic reference in fasta format. The S_lycopersicum_chromosomes.4.00.chrom1.fa is placed for testing purposes.
- a GTF annotation file. The ITAG4.0_gene_models.sub.gtf for testing purposes.
The environment.yaml is used by the conda package manager to create a working environment (see below).

You have to create the config/fastq/ folder that should contain test fastq files. These test fastq files are subsetted paired-end fastq files used to test locally the pipeline. They are generated using Seqtk:seqtk sample -s100 <inputfile> 250000 > <output file>. Files can be downloaded from Zenodo.

Installation and usage (local machine)

Installation

You will need a local copy of the GitHub genotype_by_sequencing repository on your machine. You can either:

use git in the shell: git clone git@github.com:BleekerLab/genotype_by_sequencing.git.
click on "Clone or download" and select download.
Then navigate inside the genotype_by_sequencing folder using Shell commands.

Usage

Configuration ✏️

You'll need to change a few things to accomodate this pipeline to your needs.

Make sure you have changed the parameters in the config/config.yaml file that specifies where to find the sample data file, the genomic and transcriptomic reference fasta files to use and the parameters for certains rules etc. in particular, indicate whether your data are DNA-seq or RNA-seq data
The type of input data has to be either DNA or RNA in the datatype section of the config.yaml.

This file is used so the Snakefile does not need to be changed when locations or parameters need to be changed.

📍 conda

Using the conda package manager, you need to create an environment where core softwares such as Snakemake will be installed.

Install the Miniconda3 distribution (>= Python 3.7 version) for your OS (Windows, Linux or Mac OS X).
Inside a Shell window (command line interface), create a virtual environment named gbs using the environment.yaml file with the following command: conda install -c conda-forge mamba --yes && mamba env create --file environment.yaml. The mamba package manager is faster than conda.
Then, before you run the Snakemake pipeline, activate this virtual environment with conda activate gbs.

Dry run

Use the snakemake -np to perform a dry run that prints out the rules and commands.

Real run

With conda: snakemake --cores 10

Installation and usage (HPC cluster)

Installation

You will need a local copy of the GitHub genotype_by_sequencing repository on your machine. On a HPC system, you will have to clone it using the Shell command-line: git clone git@github.com:BleekerLab/genotype_by_sequencing.git.

click on "Clone or download" and select download.
Then navigate inside the genotype_by_sequencing folder using Shell commands.

SLURM sbatch usage

See the detailed protocol here.

Here is an example script to be saved as my_run.sh and executed with SLURM as sbatch my_run.sh
This will submit the batch script to SLURM sbatch.

#!/bin/bash
#
#SBATCH --job-name=snakemake_gbs        # job name
#SBATCH  --time=24:00:00                # mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mem-per-cpu=8G                # RAM requested per job (in Snakemake, one rule = one job)
#SBATCH --output=parallel_%j.log        # standard output and error log (%j substitutes the JOB ID)


#SBATCH --nodes=1                       # Run all processes on a single node	
#SBATCH --cpus-per-task=30              # Number of CPUs per task

source activate genotype

srun snakemake -j $SLURM_CPUS_PER_TASK

SLURM interactive usage

To start an interactive session, do:

Step1: srun --time=24:00:00 --mem-per-cpu=8G --cpus-per-task=10 --pty bash -i
Step2: conda activate gbs
Step3: snakemake -j 10 (since we specified 10 cpus per task)
Step4: exit

This starts an interactive bash with 10 CPUs allocated per task and 8G of RAM per CPU.

Retrieving your results

For instance, download the results but exclude bam files (too big).
rsync -a -v -e ssh --exclude="*bam" mgallan1@omics-h0.science.uva.nl:/zfs/omics/personal/mgallan1/workspace/genotype_by_sequencing/results/ [local directory]

Useful links

Directed Acyclic Graph of jobs

References 📗

Authors

Marc Galland, m.galland@uva.nl
Tijs Bliek, m.bliek@uva.nl

Pipeline dependencies

Acknowledgments 👏

Johannes Köster; creator of Snakemake.

Citation

If you use this software, please use the following citation:

Bliek T. and Galland M. (2021). Genotyping by sequencing pipeline (version 0.1.0).

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
config		config
scripts		scripts
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
LICENSE.md		LICENSE.md
README.md		README.md
Snakefile		Snakefile
dag.pdf		dag.pdf
environment.yaml		environment.yaml
sbatch_submission_script.sh		sbatch_submission_script.sh

License

Licenses found

BleekerLab/genotype_by_sequencing

Folders and files

Latest commit

History

Repository files navigation

Genotyping by sequencing (DNA- or RNA-seq data)

Description

Steps

Input files

Output files

Prerequisites: what you should know before using this pipeline

Content of this GitHub repository

Installation and usage (local machine)

Installation

Usage

Configuration ✏️

📍 conda

Dry run

Real run

Installation and usage (HPC cluster)

Installation

SLURM sbatch usage

SLURM interactive usage

Retrieving your results

Useful links

Directed Acyclic Graph of jobs

References 📗

Authors

Pipeline dependencies

Acknowledgments 👏

Citation

About

Resources

License

Licenses found

Stars

Watchers

Forks

Languages