Skip to content

Pipeline for CoAdapTree to autonomously process raw fastq files through SNP calling with VarScan and subsequent filtering following bioinformatic best practices.

CoAdapTree/varscan_pipeline

Repository files navigation

DOI

Codacy Badge codecov


Usage

If you use or are inspired by code from this repo, please cite related manuscripts and data including, but not limited to:

  • This repo DOI

     Lind B (2021) GitHub.com/CoAdapTree/varscan_pipeline: Publication release (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.5083302
    
  • Lind et al. (in press) Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species. Accepted to Molecular Ecology Resources. Available on bioRxiv https://doi.org/10.1101/2020.10.07.329961

    Lind BM. (2021c). GitHub.com/CoAdaptree/testdata_validation: Publication release (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.5083292 
    

VarScan poolseq pipeline

Call SNPs and INDELs across pooled populations using VarScan. Filter (mapping quality, proper pairs, MAF, GQ, missing data) and redirect SNPs from repeat regions or potential paralogous sites into distinct files. Once started, the pipeline will carry on through SNP filtering, automatically sbatching jobs when appropriate. If applied on startup, user will receive an email when pipeline is finished. Various ways to customize available, see help and usage below.


Pipeline workflow

Command Specifics

  • see docstrings at top of each .py file in repo for purpose, usage, assumptions, TODOs, etc.
  • scroll each .py file to see bash text/replacements
  • see example folder for example command.sh files output from the pipeline.
  • 01_trim-fastq.py
    • Trim fastq files with fastp by iterating across windows of size 5 and asserting mean quality of 30; discard any reads that do not retain 75bp or have more than 20 N's. Attempt to remove adaptors if provided in datatable.txt
  • 02_bwa-map_view-sort_index_flagstat.py
    • Add read groups, map with bwa-mem, filter reads that have mapping quality < 20 or are not proper pairs; discard query unmapped
  • 03_mark_build.py
    • mark and remove duplicates with picardtools
  • start_varscan.py
    • call mpileup2cns with default flags except: --min-avg-qual 20 (minimum base quality set to 20), --p-value 0.05 (max p-value set to 0.05).

Workflow features

  • To reduce total wait times before a scheduled job begins to run, at each stage of the pipeline outlined above the pipeline will evenly spread Priority jobs in queue across any number of slurm accounts available to user (see docstring of balance_queue.py)
  • This script can also be run manually from command line and can be used outside of pipeline purposes

Final file output by pipeline

  • The final file will be of the form f'{pool_name}-varscan_all_bedfiles_TYPE.txt', where TYPE is one of SNP +/- PARALOGS, or REPEATS, depending on input flags to 00_start-pipeline.

Assumed environment

  1. Access to an HPC with a scheduler (e.g., slurm, SGE, PBS, TORQUE) - this pipeline assumes slurm with the Compute Canada servers in mind (not meant as a deployable 'program')
  2. Ability to load the following modules via:
    1. module load bwa/0.7.17
    2. module load fastp/0.19.5
    3. module load java
    4. module load samtools/1.9
    5. module load picard/2.18.9
    6. module load gatk/3.8 and module load gatk/4.1.0.0
    7. module load bcftools/1.9
  3. Download and install VarScan (VarScan.v2.4.3.jar). The path to the jar executable will be exported in bash_variables below. Download VarScan from here: https://github.com/dkoboldt/varscan
  4. In the parentdir folder that contains the fastq files, copy the following into a file called bash_variables. The def-someuser reflects your compute canada account that you would like to use to submit jobs. If you have multiple accounts available, the pipeline will balance load among them (you choose these accounts during 00_start execution). The following is needed to submit jobs before the pipeline balances load. See example file in GitHub repository.
    export SLURM_ACCOUNT=def-someuser
    export SBATCH_ACCOUNT=$SLURM_ACCOUNT
    export SALLOC_ACCOUNT=$SLURM_ACCOUNT
    export PYTHONPATH="${PYTHONPATH}:$HOME/pipeline"
    export SQUEUE_FORMAT="%.8i %.8u %.15a %.68j %.3t %16S %.10L %.5D %.4C %.6b %.7m %N (%r)"
    export VARSCAN_DIR=/project/def-yeaman/CoAdepTree/apps
    # placeholder for python environment activation (see below)
    
    1. The following is assumed regarding the name of slurm accounts found by sshare -U --user $USER --format=Account:
      1. The total character length of the account name is less than 15 - the full slurm account name will need to appear in the ACCOUNT column output from squeue -u $USER (using exported SQUEUE_FORMAT above); if not, increase the digits in SQUEUE_FORMAT from %0.15a to eg %0.20a.
      2. The characters in the account name that come before an underscore are sufficient to distinguish unique accounts - if the account name does not have an underscore then this is fine.
      3. The accounts that the user would like to use end in _cpu (as opposed to eg _gpu). The pipeline will skip over non-_cpu accounts.
  5. Ability to install virtual environment with python 3.7 (e.g., virtualenv --no-download ~/py3)
    1. See example instructions here: https://docs.computecanada.ca/wiki/Python
    2. Add the appropriate activation command to the bash_variables file (eg source ~/py3/bin/activate for virutalenv)
  6. The reference fasta file should be uncompressed (eg. ref.fa not ref.fa.gz), and the following commands should be executed before starting pipeline:
    1. bwa index ref.fa
    2. samtools faidx ref.fa
    3. picard.jar CreateSequenceDictionary R=ref.fa O=ref.dict
  7. clone the pipeline repo's master branch to the server and create a symlink in $HOME so that it can be accessed via $HOME/pipeline

Using the pipeline

  • First create a python3 environment (see above).

    • then use the requirements.txt file in the repo to install the appropriate python dependencies.
      • pip install -r $HOME/pipeline/requirements.txt
  • See example datatable.txt file needed for 00_start-pipeline.py.

    • file names in file_name_r1 and file_name_r2 should be basenames not full paths
    • the entries in the ref column should be full paths
    • sample names in 'sample_name' column can be duplicates for samps sequenced multiple times
    • RG info, file paths, etc should of course be different between sequenced files of single samps
  • Once the environment is set up, put datatable.txt and the fastq files (symlinks work too) into a folder. This is the folder I call PARENTDIR.

  • To kick off the pipeline, source your bash_variables file in parentdir (source bash_variables) to activate the python env, export the pythonpath to the pipeline and other slurm variables. Then run 00_start-pipeline.py from the login node, and it will run the rest of the preprocessing pipeline automatically by serially sbatching jobs up through SNP calling and filtering. If the user has chosed pipeline-finish as an email option, the pipeline will email the user once the pipeline is complete. If fail is chosen as an email option, the user will be emailed if any jobs die.

  • Once the pipeline has finished you will get an email (assuming you choose the 'pipeline-finish' email notification) for each unique pool in the pool_name column of datatable.txt. Once all pools have finished, the next step is to run 98_get_read_stats.py (see docstring at top of file for usage) to get summaries of read counts from pre-pipeline, through trimming, mapping, and realignment. Next, run 99_bundle_files_for_transfer.py (see docstring at top of file for usage) to bundle all necessary files to transfer to a local server. This script will creat a .txt file with rsync commands that the user can execute from the remote server, and can also create md5 files for .bam and varscan output.

(py3) [user@host ~]$ python $HOME/pipeline/00_start-pipeline.py -p PARENTDIR [-e EMAIL] [-n EMAIL_OPTIONS [EMAIL_OPTIONS ...]] [-maf MAF] [--translate] [--rm_repeats] [--rm_paralogs] [-h]

required arguments:
  -p PARENTDIR          /path/to/directory/with/fastq.gz-files/

optional arguments:
  -e EMAIL              the email address you would like to have notifications
                        sent to
  -n EMAIL_OPTIONS [EMAIL_OPTIONS ...]
                        the type(s) of email notifications you would like to
                        receive from the pipeline. Requires --email-address.
                        These options are used to fill out the #SBATCH flags.
                        Must be one (or multiple) of
                        ['all', 'fail', 'begin', 'end', 'pipeline-finish']
                        (default: None)
  -maf MAF              At the end of the pipeline, VCF files will be filtered
                        for MAF. If the pipeline is run on a single
                        population/pool, the user can set MAF to 0.0 so as to
                        filter variants based on global allele frequency
                        across populations/pools at a later time. (if the
                        number of sample_names in a pool == 1 then default
                        maf=0; Otherwise maf = 1/sum(ploidy column)
  --translate           Boolean: true if used, false otherwise. If a stitched
                        genome is used for mapping, this option will look for
                        a ref.order file in the same directory as the
                        ref.fasta - where ref is the basename of the ref.fasta
                        (without the .fasta). The pipeline will use this
                        .order file to translate mapped positions to
                        unstitched positions at the end of the pipeline while
                        filtering. Positions in .order file are assumed to be
                        1-based indexing. Assumes .order file has no header,
                        and is of the format (contig name from unstitched
                        genome, start/stop are positions in the stitched genome):
                        ref_scaffold<tab>contig_name<tab>start_pos<tab>stop_pos<tab>contig_length
                        (default: False)
  --rm_repeats          Boolean: true if used, false otherwise. If repeat
                        regions are available, remove SNPs that fall within
                        these regions from final SNP table and write to
                        a REPEATS table. This option will look for a .txt file
                        in the same directory as the ref.fasta. Assumes the
                        filename is of the form: ref_repeats.txt - where ref
                        is the basename of the ref.fasta (without the .fasta).
                        This file should have 1-based indexing and should be
                        located in the same directory as the reference. The
                        file should have a header ('CHROM', 'start', 'stop').
                        The CHROM column can be names in the reference (if
                        using unstitched reference), or names of contigs that
                        were stitched to form the reference. If using a
                        stitched genome, --translate is required. (default:
                        False)
  --rm_paralogs         Boolean: true if used, false otherwise. If candidate
                        sites have been isolated within the reference where
                        distinct gene copies (paralogs) map to the same
                        position (and thus create erroneous SNPs), remove any
                        SNPs that fall on these exact sites and write to a
                        PARALOGS file. The pipeline assumes this file is
                        located in the parentdir, andends with
                        '_paralog_snps.txt'. This file is tab-delimited, and
                        must have a column called 'locus' thatcontains
                        hyphen-separated CHROM-POS sites for paralogs.
                        These sites should be found in the current ref.fa
                        being used to call SNPs (otherwise SNPs cannot be
                        filtered by these sites). (default: False)
  -h, --help            Show this help message and exit.

About

Pipeline for CoAdapTree to autonomously process raw fastq files through SNP calling with VarScan and subsequent filtering following bioinformatic best practices.

Resources

Stars

Watchers

Forks

Packages

No packages published