Skip to content

CoAdapTree/gatk_pipeline

Repository files navigation

DOI


Usage

If you use or are inspired by code from this repo, please site related manuscripts and data including, but not limited to:

• This repo DOI

• Lind et al. (in press) Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species. Accepted to Molecular Ecology Resources. Available on bioRxiv https://doi.org/10.1101/2020.10.07.329961


GATK individual and poolseq pipeline


Assumed environment

  1. Access to an HPC with a scheduler (e.g., slurm, SGE, PBS, TORQUE) - this pipeline assumes slurm
  2. Ability to load the following modules via:
    1. module load bwa/0.7.17
    2. module load samtools/1.9
    3. module load picard/2.18.9
    4. module load gatk/4.1.0.0
    5. module load bcftools/1.9
  3. In the parentdir folder that contains the fastq files, copy the following into a file called bash_variables. The def-someuser reflects your compute canada account that you would like to use to submit jobs. If you have multiple accounts available, the pipeline will balance load among them (you choose these accounts during 00_start execution). The following is needed to submit jobs before the pipeline balances load. See example file in GitHub repository.
    export SLURM_ACCOUNT=def-someuser
    export SBATCH_ACCOUNT=$SLURM_ACCOUNT
    export SALLOC_ACCOUNT=$SLURM_ACCOUNT
    export PYTHONPATH="${PYTHONPATH}:$HOME/gatk_pipeline"
    export SQUEUE_FORMAT="%.8i %.8u %.15a %.68j %.3t %16S %.10L %.5D %.4C %.6b %.7m %N (%r)"
    # placeholder for python environment activation (see below)
    
    1. The following is assumed regarding the name of slurm accounts found by sshare -U --user $USER --format=Account:
      1. The total character length of the account name is less than 15 - the full slurm account name will need to appear in the ACCOUNT column output from squeue -u $USER (using exported SQUEUE_FORMAT above); if not, increase the digits in SQUEUE_FORMAT from %0.15a to eg %0.20a.
      2. The characters in the account name that come before an underscore are sufficient to distinguish unique accounts - if the account name does not have an underscore then this is fine.
      3. The accounts that the user would like to use end in _cpu (as opposed to eg _gpu). The pipeline will skip over non-_cpu accounts.
  4. Ability to install virtual environment with python 3.7 (e.g., virtualenv --no-download ~/py3`)
    1. Add the appropriate activation command to the bash_variables file (eg source ~/anaconda3/bin/activate py3 for conda, or source ~/py3/bin/activate for virutalenv)
  5. The reference fasta file should be uncompressed (eg. ref.fa not ref.fa.gz), and the following commands should be executed before starting pipeline:
    1. bwa index ref.fa
    2. samtools faidx ref.fa
  6. This pipeline assumes that the user will want to parallelize elements of the HaplotypeCaller and GenotypeGVCFs stages by using interval files (interval.list files). For example, chromosomes are often parallelized for this stage; ultimately the degree of parallelization (i.e., the number of interval.list files) is up to the user. The following is assumed about these files:
    1. They are located within a directory called 'intervals' in the same folder as the ref.fa

    2. The interval filenames are of the form 'batch_uniqueID.list'.

      1. uniqueID can be any string as long as there are no hyphens or underscores (eg batch_0013.list, batch_chrXIII.list)
    3. The contents of the .list files should have at least one entry (eg the chromosome name found in the ref.fa file) or a list of chromosomes/scaffolds/intervals with or without start-stop positions. There should not be a blank line at the end of the .list file, this will cause an error.

      batch_chrXI.list

      chrXI
      

      batch_0001.list

      Scaffold_0001
      

      batch_0013.list

      Scaffold_1693:1-122845
      Scaffold_1693:123446-409462
      Scaffold_1693:410063-584146
      ...
      
  7. Clone the pipeline repo's master branch to the server and create a symlink in $HOME so that it can be accessed via $HOME/gatk_pipeline

Using the pipeline

  • First create a python3 environment then activate it (see above).

    • then use the requirements.txt file in the repo to install the appropriate python dependencies.
      • pip install -r $HOME/gatk_pipeline/requirements.txt
  • See example datatable.txt file needed for 00_start-pipeline.py.

    • file names in file_name_r1 and file_name_r2 should be basenames not full paths
    • the entries in the ref column should be full paths
    • sample names in 'sample_name' column can be duplicates for samps sequenced multiple times
    • RG info, file paths, etc should of course be different between sequenced files of single samps
  • Once the environment is set up, put datatable.txt and the fastq files (symlinks work too) into a folder. This is the folder I call PARENTDIR.

  • To kick off the pipeline, source your bash_variables file in parentdir (source bash_variables) to activate the python env, export the pythonpath to the pipeline and other slurm variables. Then run 00_start-gatk_pipeline.py from the home node, and it will run the rest of the preprocessing pipeline automatically by serially sbatching jobs (through 05_combine_and_genotype_supervised.py).

(py3) [user@host ~]$ python $HOME/gatk_pipeline/00_start-gatk_pipeline.py -p PARENTDIR [-e EMAIL [-n EMAIL_OPTIONS]] [-maf INT] [-h]

required arguments:
  -p PARENTDIR          /path/to/directory/with/fastq.gz-files/

optional arguments:
  -e EMAIL              the email address you would like to have notifications
                        sent to (default: None)
  -n EMAIL_OPTIONS [EMAIL_OPTIONS ...]
                        the type(s) of email notifications you would like to
                        receive from the pipeline. Requires --email-address.
                        These options are used to fill out the #SBATCH flags.
                        must be one (or multiple) of ['all', 'fail', 'begin',
                        'end', 'pipeline-finish'] (default: None)
  -maf MAF              At the end of the pipeline, VCF files will be filtered
                        for MAF. If the pipeline is run on a single
                        population/pool, the user can set MAF to 0.0 so as to
                        filter variants based on global allele frequency
                        across populations/pools at a later time. (default: 0.05)
  -h, --help            Show this help message and exit.

About

Pipeline for processing raw fastq files through SNP calling with GATK and subsequent filtering

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages