ENCODE WGBS WORKFLOW

End to end implementation of the ENCODE whole genome bisulfite sequencing workflow described here. Builds off DNANexus WGBS framework and work from the Myers lab at the HudsonAlpha Institute for Biotechnology. Developed in the Zaitlen Lab, Univeristy California at San Francisco.

Overview

Takes raw whole-genome bisulfite sequencing data (fastq files) and returns the number of methylated CpGs found at each site using the Bismark software package (documentation for this can be found here). Performs preliminary methylation analysis based on MethylKit.

Software Requirements

Needed to successfuly run bash script implementing workflow

Bowtie 2 read aligner:
- Bowtie 2 - used to align raw fastq reads to a genome of choice
Bismark:
- Bismark - a software tool developed at the Babraham Institute used to map bisulfite converted reads and determine cytosine methylation states
Bisulfite converted genome:
- Raw fasta files Needed by Bowtie to map reads. A few steps must be taken to prepare the genome file for use in Bowtie/Bismark.
  1. First, the genome must be built/indexed for Bowtie. This must be done just once before running this pipeline for the first time. This built version can be used for all subsequent runs. Documentation from the next-gen analysis software Homer provides an excellent tutorial for how to prepare the genome for bowtie.
    - Illumina iGenome provides several pre indexed/built genomes for popular model organisms.
  2. Next, the genome must be bisulfite converted for use in Bismark. The Bismark documentation provides guidance on this under the bismark_genome_preparation section. This step will probably take several hours.
Trim Galore:
- Trim Galore adapter trimmer - applies quality filters and adapter trimming to raw fastq files. Also developed at the Babraham Institute. Depends on a bunch of software itself
  1. Cutadapt - removes unwanted sequences from fastq files
  2. Fastqc - provides quality information on raw fastq files. Useful for generating visualizations.
Samtools:
- Samtools - needed for manipulation of bam/sam files generated by Bismark.

To Run

This pipeline is a Sun Grid Engine implementation, meaning that it uses a qsub_submit.sh script to schedule jobs and manage memory. This script assumes you have paired ended sequencing. Slight modifications can be made to make it usable for single ended sequencing.

qsub_submit.sh is the only script that needs to be modified to run the pipeline. The following variables need to be updated in order to run the pipeline:

QSUB parameters

-t: the number of samples to be processed
-l: the time to run the script for (default is maximum time allowed)
netapp=2G,scratch=2G: locations of scratch directories, depends on your system setup

Software Environment

GENOME_PATH: the location of your downloaded genome
BISMARK_PATH: location of the bismark installation
SAMTOOLS_PATH: location of samtools installation
TRIMGALORE_PATH: location of trimgalore installation
BOWTIE_PATH: location of bowtie installation

Location of input/output directories

input: location of fastq files to be processed
output: location to write output files to
fastq_prefix: how to reference and name your fastq files. In the example script, an SGE TASK ID is used as the prefix to enable easy parallelization. Script finds all fastq files in input location that satisfy input/prefix.fastq.gz
scripts: location of bismark shell scripts contained in this repository

Sample qsub submit script

################ qsub parameters ###########################
#$ -S /bin/bash                     
#$ -cwd                             
#$ -r y                            
#$ -j y                           
#$ -l mem_free=15G                 
#$ -l arch=linux-x64               
#$ -l netapp=2G,scratch=2G         
#$ -l h_rt=336:00:00
#$ -t 9:10  

################## environment ###############################
GENOME_PATH="/ye/zaitlenlabstore/christacaggiano/hg38/"
BISMARK_PATH="/ye/zaitlenlabstore/christacaggiano/Bismark"
SAMTOOLS_PATH="/ye/netapp/jimmie.ye/tools/samtools-1.3"
TRIMGALORE_PATH="/ye/zaitlenlabstore/christacaggiano/trim-galore/TrimGalore-0.4.3"
BOWTIE_PATH="/ye/zaitlenlabstore/christacaggiano/bowtie2-2.3.3"

################ experiment directories ######################
input="/ye/zaitlenlabstore/christacaggiano/cfDNA_project/second_primary_methylation_data_ALS/fastq_files/zaitlenn-"$SGE_TASK_ID"_BC"
output="/zaitlen/netapp/group/christa"
fastq_prefix=$SGE_TASK_ID
scripts="scripts"

################# runs the pipeline ##########################
python wgbs.py $input $output $fastq_prefix $scripts

Run Command

qsub qsub_submit.sh

Output

Sample output directory

11/
# fastq file split into ~7 million lines for easier processing
├── 11.R100
├── 11.R101
├── 11.R102
...
├── 11.R222
├── 11.R223
├── 11.R224
├── 11.R225
├── bam_files
# trimmed bam files with report on trimming
│   ├── 11.R100_val_1_bismark_bt2_pe.bam
│   ├── 11.R100_val_1_bismark_bt2_PE_report.txt
│   ...
│   ├── 11.R125_val_1_bismark_bt2_pe.bam
│   ├── 11.R125_val_1_bismark_bt2_PE_report.txt
# path to put any logs that occur during sequencing
├── log_path
# temporary files. useful for troubleshooting
├── temp_dir
│   ├── 11.R100_trimming_report.txt
│   ├── 11.R100_val_1.fq
│   ├── 11.R102_trimming_report.txt
... ...
│   └── 11.R225_val_2.fq
# bismark methylation files
└── unsortedButMerged_ForBismark_file
    ├── 11_unsorted_merged.bam # all trimmed bam files merged into one
    ├── 11_unsorted_merged.deduplicated.bam  # deduplicated bam file
    ├── 11_unsorted_merged.deduplication_report.txt  # report on deduplication
# methylation extraction performed by bismark
    └── methylation_extraction
        ├── 11_unsorted_merged.deduplicated.bedGraph.gz # percent methylated for each CpG
        ├── 11_unsorted_merged.deduplicated.bismark.cov.gz # count methylated and unmethylated for each CpG as well as % methylated
        ├── 11_unsorted_merged.deduplicated.M-bias.txt  # graph on methylation bias
        ├── 11_unsorted_merged.deduplicated_splitting_report.txt  # report on methylation calls
        ├── CHG_context_11_unsorted_merged.deduplicated.txt  # every CHG methylation call
        ├── CHH_context_11_unsorted_merged.deduplicated.txt # every CHH methylation call
        └── CpG_context_11_unsorted_merged.deduplicated.txt # every CpG methylation call

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
log_files		log_files
percent_methylation		percent_methylation
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
runFastQC		runFastQC
runWGBS		runWGBS
wgbs.py		wgbs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log_files

log_files

percent_methylation

percent_methylation

scripts

scripts

.DS_Store

.DS_Store

.gitignore

.gitignore

README.md

README.md

runFastQC

runFastQC

runWGBS

runWGBS

wgbs.py

wgbs.py

Repository files navigation

ENCODE WGBS WORKFLOW

Overview

Software Requirements

To Run

QSUB parameters

Software Environment

Location of input/output directories

Sample qsub submit script

Run Command

Output

Sample output directory

About

Releases

Packages

Languages

christacaggiano/ENCODE_WGBS

Folders and files

Latest commit

History

Repository files navigation

ENCODE WGBS WORKFLOW

Overview

Software Requirements

To Run

QSUB parameters

Software Environment

Location of input/output directories

Sample qsub submit script

Run Command

Output

Sample output directory

About

Topics

Resources

Stars

Watchers

Forks

Languages