Whole-Exome sequencing data analysis

Note

This code/pipeline was written by me in 2018-2019 while working in the RayaGen startup. The main field where it could be employed is automating whole-exome sequencing data analysis from raw (fastq) to variant call format (vcf). This vcf could be further pass to a variant annotation tool of your choice. This pipeline uses snakemake workflow management system to run the analysis.

The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment. For full documentation and installation please refer to the original website.

In this workflow, several addresses and paths were defined in the snakefile:

SAMPLES, = glob_wildcards("raw/{sample}_1.fastq.gz") # your file should be in a folder in this way: "/wes/raw"
READS=["1", "2"]
BASE_DIR = "/path/to/wes"
ADAPTORS = BASE_DIR + "/software/trimmomatic-0.38/adapters/TruSeq3-PE.fa"
REF_GENOME = "/path/to/wes/ref/human_g1k_v37_decoy.fasta"
BWA_INDEX = BASE_DIR + "/ref"
KNOWEN_SITE_SNP = BASE_DIR + "/ref/dbsnp_138.b37.vcf"
KNOWEN_SITE_INDEL = BASE_DIR + "/ref/Mills_and_1000G_gold_standard.indels.b37.vcf"
GATK3 = "/path/to/wes/software/GenomeAnalysisTK.jar"
PICARD = "/path/to/wes/software/picard.jar"



rule all:
  input:  expand("VCF/{sample}_filtered_final.vcf",sample=SAMPLES),
          #expand("VCF/{sample}_filtered_INDEL.vcf",sample=SAMPLES),
          #expand("VCF/{sample}_filtered_SNP.vcf",sample=SAMPLES),
          #expand("VCF/{sample}_raw_INDEL.vcf",sample=SAMPLES),
          #expand("VCF/{sample}_raw_SNP.vcf",sample=SAMPLES),
          #expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
          #expand("BQSR/recal_reads_{sample}.bam",sample=SAMPLES),
          #"BQSR/recal_data.table",
          #expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
          #expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
         # expand("alignment/{sample}.bam",sample=SAMPLES),
          #expand("interleaved_fastq/{sample}_interleaved.fastq.gz", sample=SAMPLES),
          expand("multi_qc/{sample}.multiqc.html", sample=SAMPLES),
          #expand("seqs_trimmed/{sample}_trimmed_paired_{read}.fastq.gz", sample=SAMPLES, read=READS),
          expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.html", sample=SAMPLES, read=READS),
          expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.zip", sample=SAMPLES, read=READS),
          #expand("seqs_trimmed/{sample}_trimmed_paired_R1.fastq.gz", sample=SAMPLES),
          #expand("seqs_trimmed/{sample}_trimmed_unpaired_R1.fastq.gz", sample=SAMPLES),
          #expand("seqs_trimmed/{sample}_trimmed_paired_R2.fastq.gz", sample=SAMPLES),
          #expand("seqs_trimmed/{sample}_trimmed_unpaired_R2.fastq.gz", sample=SAMPLES),
          expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS),
          expand("qc_raw/{sample}_{read}_fastqc.zip", sample=SAMPLES, read=READS),

rule fastqc_raw:
    input:  expand("raw/{sample}_{read}.fastq.gz", sample=SAMPLES, read=READS)
    output: expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS),
            expand("qc_raw/{sample}_{read}_fastqc.zip", sample=SAMPLES, read=READS)
    log:    expand("logs/fastqc/{sample}_{read}_raw.fastqc.err", sample=SAMPLES, read=READS)
    priority:1
    message: "Executing Quality check (FASTQC)  on the following files {input}."
    shell: """
      mkdir -p logs/fastqc
      mkdir -p qc_raw
      fastqc --outdir qc_raw  --thread 50 --nogroup {input} 2>>{log}
        """
rule trimming:
     input:
        fwd = "raw/{sample}_R1.fastq.gz",
        rev = "raw/{sample}_R2.fastq.gz",
     output:
        fwd_paired = "seqs_trimmed/{sample}_trimmed_paired_R1.fastq.gz",
        fwd_unpaired = "seqs_trimmed/{sample}_trimmed_unpaired_R1.fastq.gz",
        rev_paired = "seqs_trimmed/{sample}_trimmed_paired_R2.fastq.gz",
        rev_unpaired = "seqs_trimmed/{sample}_trimmed_unpaired_R2.fastq.gz"
     params:
       TRIMMOMATIC_JAR = "~/RayaGene/wes/software/trimmomatic-0.38/trimmomatic-0.38.jar",
       PAIR = "PE SE".split(),
       threads=50,
       TRIM_OPTS = "-phred33 ILLUMINACLIP:" + ADAPTORS + ":2:30:10 LEADING:3 TRAILING:3  SLIDINGWINDOW:4:20 MINLEN:36"
     log:  "logs/trimmomatic/{sample}.paired.trimomatic.log"
     priority:2
     message: "Executing Trimming (trimmomatic) on the following files {input.fwd} and {input.rev}."
     shell: """
        mkdir -p logs/trimmomatic
        java -jar {params.TRIMMOMATIC_JAR} \
           {params.PAIR[0]} -threads {params.threads} {input.fwd} {input.rev} \
           {output.fwd_paired} {output.fwd_unpaired} \
           {output.rev_paired} {output.rev_unpaired} \
           {params.TRIM_OPTS}  2>{log}
        """
rule fastqc_trimmed:
    input:  expand("seqs_trimmed/{sample}_trimmed_paired_{read}.fastq.gz", sample=SAMPLES, read=READS)
    output: expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.html", sample=SAMPLES, read=READS),
            expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.zip", sample=SAMPLES, read=READS)
    log:    expand("logs/fastqc/{sample}_{read}_trimmed.fastqc.err", sample=SAMPLES, read=READS)
    priority:3
    message: "Executing Quality Checks (FASTQC)  on the following files {input}."
    shell: """
       fastqc --outdir qc_trimmed --thread 80 --nogroup {input} \
             >> qc_trimmed/fastqc.log 2>>{log}
          """
rule multiqc:
    input:  raw_fwd=    expand("qc_raw/{sample}_R1_fastqc.zip", sample= SAMPLES ),
            raw_rev=    expand("qc_raw/{sample}_R2_fastqc.zip", sample= SAMPLES ),
            trimed_fwd= expand("qc_trimmed/{sample}_trimmed_paired_R1_fastqc.zip", sample= SAMPLES ),
            trimed_rev= expand("qc_trimmed/{sample}_trimmed_paired_R2_fastqc.zip", sample= SAMPLES ),
    output: expand("multi_qc/{sample}.multiqc.html", sample=SAMPLES)
    log:    expand("logs/multi_qc/{sample}.log", sample=SAMPLES)
    priority:4
    shell:"""
        mkdir -p multi_qc
        multiqc {input.raw_fwd} {input.raw_rev} {input.trimed_fwd} {input.trimed_rev} > {output} 2>{log}
        mv multiqc_report.html ./multi_qc
        mv multiqc_data multi_qc
        """
rule seqtk:
    input:  fwd= expand("seqs_trimmed/{sample}_trimmed_paired_R1.fastq.gz", sample= SAMPLES ),
            rev= expand("seqs_trimmed/{sample}_trimmed_paired_R2.fastq.gz", sample= SAMPLES ),
    output: expand("interleaved_fastq/{sample}_interleaved.fastq.gz", sample=SAMPLES),
    log:    expand("logs/seqtk/{sample}.log", sample=SAMPLES),
    priority:5
    shell:"""
        mkdir -p interleaved_fastq
       seqtk mergepe {input.fwd} {input.rev} > {output} 2>{log}
       """
rule bwa_map:
    input:
        ref=REF_GENOME,
        read=expand("interleaved_fastq/{sample}_interleaved.fastq.gz", sample=SAMPLES),
    output:    bam = expand("alignment/{sample}.bam",sample=SAMPLES),
               #sorted = expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
               #bai = expand("alignment/sorted_{sample}.bam.bai",sample=SAMPLES),
    log:expand("logs/alignment/{sample}.log", sample=SAMPLES),
    params:
    rg = expand("'@RG\\tID:C6C0TANXX_2\\tSM:ZW177\\tLB:ZW177lib\\tPL:ILLUMINA'")
    priority:6
    shell:
       r"""
           mkdir -p alignment
           mkdir -p logs/alignment
           bwa mem -M -t 50 -p {input.ref} {input.read} -R {params.rg}\
            |samtools view -Sb - > {output.bam} 2>log
            """
rule bam_sort_index:
            input:expand("alignment/{sample}.bam",sample=SAMPLES),
            output:sorted = expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
                   bai = expand("alignment/sorted_{sample}.bam.bai",sample=SAMPLES),
            shell:
               r"""
               samtools sort {input} > {output.sorted}
               samtools index {output.sorted} > {output.bai}
               """
rule mark_dup:
    input: expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
    output:    bam = expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
               metrics ="alignment/marked_dup_metrics.txt",
               bai =expand("alignment/marked_duplicates_{sample}.bam.bai",sample=SAMPLES),
    log:expand("logs/mark_dup/{sample}.log", sample=SAMPLES),
    params:
       PICARD_JAR = PICARD,
    threads: 8
    priority:7
    shell:
       r"""
            mkdir -p logs/mark_dup
            java -jar -Xmx64G {params.PICARD_JAR} MarkDuplicates I={input} O={output.bam} M={output.metrics} TMP_DIR=/tmp
            samtools index {output.bam} > {output.bai}
            """
rule BQSR_1:
    input: bam = expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
           ref= REF_GENOME,
           known_site_snp= KNOWEN_SITE_SNP,
           known_site_indel = KNOWEN_SITE_INDEL,
    output:"BQSR/recal_data.table",
    log:expand("logs/BQSR/{sample}.log", sample=SAMPLES),
    params:
       GATK3_JAR = GATK3,
    priority:8
    shell:
       r"""
            mkdir -p logs/BQSR
            java -jar -Xmx64G {params.GATK3_JAR} -nct 32 -T BaseRecalibrator \
            -R {input.ref} -I {input.bam} \
            -knownSites {input.known_site_snp} \
            -knownSites {input.known_site_indel} \
            -o {output}
            """
rule BQSR_2:
    input: bam = expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
           ref= REF_GENOME,
           recal_table = "BQSR/recal_data.table",
    output:expand("BQSR/recal_reads_{sample}.bam",sample=SAMPLES),
    log:expand("logs/BQSR/{sample}_2.log", sample=SAMPLES),
    params:
       GATK3_JAR = GATK3,
    priority:9
    shell:
       r"""
            java -jar -Xmx64G {params.GATK3_JAR} -nct 64 -T PrintReads \
            -R {input.ref} \
            -I {input.bam} \
            -BQSR {input.recal_table} \
            -o {output} 2>log
            """
rule Variant_calling:
    input: bam = expand("BQSR/recal_reads_{sample}.bam",sample=SAMPLES),
           ref= REF_GENOME,
           known_site_snp= KNOWEN_SITE_SNP,
           recal_table = "BQSR/recal_data.table",
           known_site_indel = KNOWEN_SITE_INDEL
    output:expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
    log:expand("logs/VCF/{sample}_raw_vcf.log", sample=SAMPLES),
    params:
       GATK3_JAR = GATK3,
    priority:10
    shell:
       r"""
            java -jar -Xmx80G {params.GATK3_JAR} -nct 64 -T HaplotypeCaller  \
            -R {input.ref} \
            -I {input.bam} \
            -stand_call_conf 30 \
            --genotyping_mode DISCOVERY \
            -o {output} 2>log
            """
rule Select_SNP:
    input: vcf = expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
           ref= REF_GENOME,
    output:expand("VCF/{sample}_raw_SNP.vcf",sample=SAMPLES),
    log:expand("logs/VCF/{sample}_raw_SNP_vcf.log", sample=SAMPLES),
    params:
       GATK3_JAR = GATK3,
    threads: 8
    priority:11
    shell:
       r"""
            java -jar -Xmx80G {params.GATK3_JAR} -T SelectVariants  \
            -R {input.ref} \
            -V {input.vcf} \
            -selectType SNP \
            -o {output} 2>log
            """
rule Select_INDEL:
    input: vcf = expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
           ref= REF_GENOME,
    output:expand("VCF/{sample}_raw_INDEL.vcf",sample=SAMPLES),
    log:expand("logs/VCF/{sample}_raw_INDEL_vcf.log", sample=SAMPLES),
    params:
       GATK3_JAR = GATK3,
    threads: 8
    priority:12
    shell:
       r"""
            java -jar -Xmx80G {params.GATK3_JAR} -T SelectVariants  \
            -R {input.ref} \
            -V {input.vcf} \
            -selectType INDEL \
            -o {output} 2>log
            """
rule SNP_FILTER:
    input: vcf = expand("VCF/{sample}_raw_SNP.vcf",sample=SAMPLES),
           ref= REF_GENOME,
    output:expand("VCF/{sample}_filtered_SNP.vcf",sample=SAMPLES),
    log:expand("logs/VCF/{sample}_filtered_SNP_vcf.log", sample=SAMPLES),
    params:
       GATK3_JAR = GATK3,
    threads: 8
    priority:13
    shell:
       r"""
            java -jar -Xmx80G {params.GATK3_JAR} -T VariantFiltration  \
            -R {input.ref} \
            -V {input.vcf} \
            --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
            --filterName "Low_Qual" \
            -o {output} 2>log
            """
rule INDEL_FILTER:
    input: vcf = expand("VCF/{sample}_raw_INDEL.vcf",sample=SAMPLES),
           ref= REF_GENOME,
    output:expand("VCF/{sample}_filtered_INDEL.vcf",sample=SAMPLES),
    log:expand("logs/VCF/{sample}_filtered_INDEL_vcf.log", sample=SAMPLES),
    params:
       GATK3_JAR = GATK3,
    threads: 8
    priority:13
    shell:
       r"""
            java -jar -Xmx80G {params.GATK3_JAR} -T VariantFiltration  \
            -R {input.ref} \
            -V {input.vcf} \
            --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
            --filterName "Low_Qual" \
            -o {output} 2>log
            """
rule combine_variants:
    input:snp =   expand("VCF/{sample}_filtered_SNP.vcf",sample=SAMPLES),
          indel = expand("VCF/{sample}_filtered_INDEL.vcf",sample=SAMPLES),
           ref= REF_GENOME,
    output:expand("VCF/{sample}_filtered_final.vcf",sample=SAMPLES),
    log:expand("logs/VCF/{sample}_filtered_final_vcf.log", sample=SAMPLES),
    params:
       GATK3_JAR = GATK3,
    priority:13
    shell:
       r"""
            java -jar -Xmx80G {params.GATK3_JAR} -T CombineVariants  \
            -R {input.ref} \
            -V {input.snp} \
            -V {input.indel} \
            --genotypemergeoption UNSORTED \
            -o {output} 2>log

There is a configuration yaml file in this directory that could be of help in making same virtual environment in pythonto what I had configured, wes. By conda you can make same environment to wes.

Pipeline steps

This pipeline consisted of several steps:

QC by fastqc
read trimming by trimmomatics
summarizing qc parameters by multiqc
read mapping by bwa
marking duplicated by picard
base quality recalibration by gatk
variant calling by gatk

fnal note: In the repositiry there is a Snakefile file. The content of this file is copied in Snakefile.sh for inspection purpose, however for pipeline running Snakefile should be downloaded and located in your wes directory.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md
Snakefile		Snakefile
Snakefile.sh		Snakefile.sh
wes.yaml		wes.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Snakefile

Snakefile

Snakefile.sh

Snakefile.sh

wes.yaml

wes.yaml

Repository files navigation

Whole-Exome sequencing data analysis

Note

Pipeline steps

About

Releases

Packages

Languages

hamidghaedi/pipeline_whole_exome_sequencing_analysis

Folders and files

Latest commit

History

Repository files navigation

Whole-Exome sequencing data analysis

Note

Pipeline steps

About

Topics

Resources

Stars

Watchers

Forks

Languages