human_genomics_pipeline

A simple Snakemake workflow to process paired-end sequencing data (WGS or WES) using bwa, sambamba and GATK4.

human_genomics_pipeline

workflow diagram

How to run human_genomics_pipeline

Prerequisite software: Git, Conda 4.8.2, wget, gunzip, bwa
OS: Validated on Ubuntu 16.04

This pipeline is in ongoing development. It is currently operational for single samples of whole exome sequencing data (not cohort data yet). It should theoretically be operational for whole genome sequencing data, but this has not yet been validated. It can be run against either the GRCh37 or GRCh38 reference genome. Benchmarking is currently underway.

1. Fork the pipeline repo to a personal or lab account

See here for help

2. Take the pipeline to the data on your local machine

Clone the forked human_genomics_pipeline repo into the same directory as your paired end fastq data to be processed. Required folder structure and file naming convention:

.
|___fastq/
|     |___sample1_R1
|     |___sample1_R2
|     |___sample2_R1
|     |___sample2_R2
|     |___ ...
|
|___human_genomics_pipeline/

See here for help

3. Create a local copy of the reference human genome and variation databases (either GRCh37 or GRCh38)

GRCh37

Download the reference human genome (GRCh37) and it's associated fasta sequence dictionary file (.dict) and fasta index file (.fai) files from the GATK resource bundle

wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz
gunzip ucsc.hg19.fasta.gz
wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.dict.gz
gunzip ucsc.hg19.dict.gz
wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.fai.gz
gunzip ucsc.hg19.fasta.fai.gz

Create index files for the genome sequence (.amb, .ann, .bwt, .pac, .sa)

bwa index -a bwtsw ucsc.hg19.fasta

Download dbSNP (build 151) from NCBI

wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/GATK/All_20180423.vcf.gz
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/GATK/All_20180423.vcf.gz.tbi

Download other variation databases from the GATK resource bundle

wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz
wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.idx.gz

GRCh38

Download the reference human genome (GRCh38) and it's associated fasta sequence dictionary file (.dict) and fasta index file (.fai) files from the GATK resource bundle

wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz
gunzip Homo_sapiens_assembly38.fasta.gz
wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.dict
wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.fai

Create index files for the genome sequence (.amb, .ann, .bwt, .pac, .sa)

bwa index -a bwtsw Homo_sapiens_assembly38.fasta

Download dbSNP (build 151) from NCBI

wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/GATK/All_20180418.vcf.gz
wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/GATK/All_20180418.vcf.gz.tbi

Download other variation databases from the GATK resource bundle

wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi

4. Modify the configuration file

Either edit the 'config_template.yaml' file from scratch, or edit a config file 'config_examples'.

Specify whether you are running your analysis against the GRCh37 or GRCh38 build of the reference genome and whether the data is to be analysed on it's own ('Single') or as a part of a cohort ('Cohort'). For example:

# Type of input data (either 'Single' or 'Cohort')
DATA: "Single"

Set the the working directories in the config file to the reference human genome file (GRCh37 or GRCh38), dbSNP database file (GRCh37 or GRCh38) and a temporary directory. For example:

# File directories to reference genome and dbSNP database
REFGENOME: "/home/lkemp/publicData/referenceGenome/Homo_sapiens_assembly38.fasta"
dbSNP: "/home/lkemp/publicData/dbSNP/All_20180418.vcf.gz"

# Temporary file directory
TEMPDIR: "/home/lkemp/tmp/"

If analysing WES data, pass a design file (.bed) indicating the genomic regions that were sequenced (see here for more information on accessing design files). Also set the level of padding. For example:

If NOT analysing WES data, leave these fields blank

# WES settings (leave blank if analysing other data such as WGS)
WES:
  # Genomic intervals over which to operate
  INTERVALS: "-L /home/lkemp/publicData/sure_select_human_all_exon_V7/S31285117_AllTracks.bed"
  # Padding (in bp) to add to each interval
  PADDING: "-ip 100"

Pass the resources to be used to recalibrate bases with gatk BaseRecalibrator to the --known-sites flag. For example:

RECALIBRATION:
  RESOURCES: "--known-sites /home/lkemp/publicData/dbSNP/All_20180418.vcf.gz
              --known-sites /home/lkemp/publicData/mills/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"

Save your modified config file with a descriptive name

5. Configure to run on a HPC (optional)

In theory, this cluster configuration should be adaptable to other job scheduler systems, but here I will demonstrate how to integrate this pipeline with slurm.

Configure account:, partition: and nodelist: in default section of 'cluster.json' in order to set the parameters for slurm sbatch (see documentation here and here). For example:

{
    "__default__" :
    {
        "account" : "lkemp",
        "nodes" : 1,
        "ntasks" : 4,
        "partition" : "prod",
        "nodelist" : "kscprod-bio4"
    }
}

This is a good place to go for a good working example if you get stuck.

These variables will need to be passed to snakemake in the snakemake run script (see example in step 6).

6. Modify the run scripts

Specify your config file to be used with the --configfile flag and modify the number of cores to be used with the -j flag. For example:

Dry run (dryrun.sh):

snakemake \
-n -j 32 \
--use-conda \
--configfile config_37_single_WES.yaml

Full run (run.sh):

snakemake \
-j 32 \
--use-conda \
--configfile config_37_single_WES.yaml

Report (report.sh)

snakemake \
--report ESR_report.html \
--configfile config_37_single_WES.yaml \
--report-stylesheet ESR_stylesheet.css

See the snakemake documentation for additional run parameters.

HPC

If you want to run the pipeline on a HPC, pass the cluster variables set in 'cluster.json' to the dry run and full run scripts. For example:

Dry run (dryrun.sh):

snakemake \
-n -j 32 \
--use-conda \
--configfile config_37_single_WES.yaml \
--cluster-config cluster.json \
--cluster "sbatch -A {cluster.account} \
-p {cluster.partition} \
--nodes {cluster.nodes} \
--ntasks {cluster.ntasks} \
--nodelist {cluster.nodelist}"

Full run (run.sh):

snakemake \
-j 32 \
--use-conda \
--configfile config_37_single_WES.yaml \
--cluster-config cluster.json \
--cluster "sbatch -A {cluster.account} \
-p {cluster.partition} \
--nodes {cluster.nodes} \
--ntasks {cluster.ntasks} \
--nodelist {cluster.nodelist}"

7. Create and activate a conda environment with python and snakemake installed

conda env create -f pipeline_run_env.yml
conda activate pipeline_run_env

8. Run the pipeline

First carry out a dry run

bash dryrun.sh

If there are no issues, start a full run

bash run.sh

9. Evaluate the pipeline run

Generate an interactive html report

bash report.sh

10. Commit and push to your forked version of the repo

To maintain reproducibility, commit and push:

All configuration files
All run scripts
The final report

11. Repeat step 10 each time you re-run the analysis with different parameters

12. Create a pull request with the upstream repo to merge any useful changes (optional)

Contributions and feedback are more than welcome! 😊

See here for help

Useful reading

Van der Auwera et al., (2013). Current Protocols in Bioinformatics. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline; 11(1110): 11.10.1–11.10.33.

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
config_examples		config_examples
envs		envs
report		report
rules		rules
.gitignore		.gitignore
ESR_stylesheet.css		ESR_stylesheet.css
README.md		README.md
Snakefile		Snakefile
cluster.json		cluster.json
config_template.yaml		config_template.yaml
dryrun.sh		dryrun.sh
pipeline_run_env.yml		pipeline_run_env.yml
report.sh		report.sh
rulegraph.png		rulegraph.png
run.sh		run.sh

MerrimanLab/human_genomics_pipeline

Folders and files

Latest commit

History

Repository files navigation

human_genomics_pipeline

workflow diagram

How to run human_genomics_pipeline

1. Fork the pipeline repo to a personal or lab account

2. Take the pipeline to the data on your local machine

3. Create a local copy of the reference human genome and variation databases (either GRCh37 or GRCh38)

GRCh37

GRCh38

4. Modify the configuration file

5. Configure to run on a HPC (optional)

6. Modify the run scripts

HPC

7. Create and activate a conda environment with python and snakemake installed

8. Run the pipeline

9. Evaluate the pipeline run

10. Commit and push to your forked version of the repo

11. Repeat step 10 each time you re-run the analysis with different parameters

12. Create a pull request with the upstream repo to merge any useful changes (optional)

Useful reading

About

Resources

Stars

Watchers

Forks

Languages