GATK individual and poolseq pipeline

Usage

If you use or are inspired by code from this repo, please site related manuscripts and data including, but not limited to:

• This repo

• Lind et al. (in press) Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species. Accepted to Molecular Ecology Resources. Available on bioRxiv https://doi.org/10.1101/2020.10.07.329961

see https://github.com/CoAdapTree/testdata_validation

GATK individual and poolseq pipeline

Assumed environment

Access to an HPC with a scheduler (e.g., slurm, SGE, PBS, TORQUE) - this pipeline assumes slurm
Ability to load the following modules via:
1. module load bwa/0.7.17
2. module load samtools/1.9
3. module load picard/2.18.9
4. module load gatk/4.1.0.0
5. module load bcftools/1.9
In the parentdir folder that contains the fastq files, copy the following into a file called bash_variables. The def-someuser reflects your compute canada account that you would like to use to submit jobs. If you have multiple accounts available, the pipeline will balance load among them (you choose these accounts during 00_start execution). The following is needed to submit jobs before the pipeline balances load. See example file in GitHub repository.
```
export SLURM_ACCOUNT=def-someuser
export SBATCH_ACCOUNT=$SLURM_ACCOUNT
export SALLOC_ACCOUNT=$SLURM_ACCOUNT
export PYTHONPATH="${PYTHONPATH}:$HOME/gatk_pipeline"
export SQUEUE_FORMAT="%.8i %.8u %.15a %.68j %.3t %16S %.10L %.5D %.4C %.6b %.7m %N (%r)"
# placeholder for python environment activation (see below)
```
1. The following is assumed regarding the name of slurm accounts found by sshare -U --user $USER --format=Account:
  1. The total character length of the account name is less than 15 - the full slurm account name will need to appear in the ACCOUNT column output from squeue -u $USER (using exported SQUEUE_FORMAT above); if not, increase the digits in SQUEUE_FORMAT from %0.15a to eg %0.20a.
  2. The characters in the account name that come before an underscore are sufficient to distinguish unique accounts - if the account name does not have an underscore then this is fine.
  3. The accounts that the user would like to use end in _cpu (as opposed to eg _gpu). The pipeline will skip over non-_cpu accounts.
Ability to install virtual environment with python 3.7 (e.g., virtualenv --no-download ~/py3`)
1. Add the appropriate activation command to the bash_variables file (eg source ~/anaconda3/bin/activate py3 for conda, or source ~/py3/bin/activate for virutalenv)
The reference fasta file should be uncompressed (eg. ref.fa not ref.fa.gz), and the following commands should be executed before starting pipeline:
1. bwa index ref.fa
2. samtools faidx ref.fa
This pipeline assumes that the user will want to parallelize elements of the HaplotypeCaller and GenotypeGVCFs stages by using interval files (interval.list files). For example, chromosomes are often parallelized for this stage; ultimately the degree of parallelization (i.e., the number of interval.list files) is up to the user. The following is assumed about these files:
1. They are located within a directory called 'intervals' in the same folder as the ref.fa
2. The interval filenames are of the form 'batch_uniqueID.list'.
  1. uniqueID can be any string as long as there are no hyphens or underscores (eg batch_0013.list, batch_chrXIII.list)
3. The contents of the .list files should have at least one entry (eg the chromosome name found in the ref.fa file) or a list of chromosomes/scaffolds/intervals with or without start-stop positions. There should not be a blank line at the end of the .list file, this will cause an error.
  
  batch_chrXI.list
```
chrXI
```
  batch_0001.list
```
Scaffold_0001
```
  batch_0013.list
```
Scaffold_1693:1-122845
Scaffold_1693:123446-409462
Scaffold_1693:410063-584146
...
```
Clone the pipeline repo's master branch to the server and create a symlink in $HOME so that it can be accessed via $HOME/gatk_pipeline

Using the pipeline

First create a python3 environment then activate it (see above).
- then use the requirements.txt file in the repo to install the appropriate python dependencies.
  - pip install -r $HOME/gatk_pipeline/requirements.txt
See example datatable.txt file needed for 00_start-pipeline.py.
- file names in file_name_r1 and file_name_r2 should be basenames not full paths
- the entries in the ref column should be full paths
- sample names in 'sample_name' column can be duplicates for samps sequenced multiple times
- RG info, file paths, etc should of course be different between sequenced files of single samps
Once the environment is set up, put datatable.txt and the fastq files (symlinks work too) into a folder. This is the folder I call PARENTDIR.
To kick off the pipeline, source your bash_variables file in parentdir (source bash_variables) to activate the python env, export the pythonpath to the pipeline and other slurm variables. Then run 00_start-gatk_pipeline.py from the home node, and it will run the rest of the preprocessing pipeline automatically by serially sbatching jobs (through 05_combine_and_genotype_supervised.py).

(py3) [user@host ~]$ python $HOME/gatk_pipeline/00_start-gatk_pipeline.py -p PARENTDIR [-e EMAIL [-n EMAIL_OPTIONS]] [-maf INT] [-h]

required arguments:
  -p PARENTDIR          /path/to/directory/with/fastq.gz-files/

optional arguments:
  -e EMAIL              the email address you would like to have notifications
                        sent to (default: None)
  -n EMAIL_OPTIONS [EMAIL_OPTIONS ...]
                        the type(s) of email notifications you would like to
                        receive from the pipeline. Requires --email-address.
                        These options are used to fill out the #SBATCH flags.
                        must be one (or multiple) of ['all', 'fail', 'begin',
                        'end', 'pipeline-finish'] (default: None)
  -maf MAF              At the end of the pipeline, VCF files will be filtered
                        for MAF. If the pipeline is run on a single
                        population/pool, the user can set MAF to 0.0 so as to
                        filter variants based on global allele frequency
                        across populations/pools at a later time. (default: 0.05)
  -h, --help            Show this help message and exit.

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
.gitattributes		.gitattributes
00_start-gatk_pipeline.py		00_start-gatk_pipeline.py
01_trim-fastq.py		01_trim-fastq.py
02_bwa-map_view_sort_index_flagstat.py		02_bwa-map_view_sort_index_flagstat.py
03_mark_build.py		03_mark_build.py
04_scatter-gvcf.py		04_scatter-gvcf.py
05_combine_and_genotype_supervised.py		05_combine_and_genotype_supervised.py
06_filter_concat_scaffolds.py		06_filter_concat_scaffolds.py
98_get_read_stats.py		98_get_read_stats.py
99_bundle_files_for_transfer.py		99_bundle_files_for_transfer.py
README.md		README.md
balance_queue.py		balance_queue.py
bash_variables		bash_variables
coadaptree.py		coadaptree.py
datatable.txt		datatable.txt
genotyping_helper.py		genotyping_helper.py
genotyping_rescheduler.py		genotyping_rescheduler.py
genotyping_scheduler.py		genotyping_scheduler.py
gvcf_helper.py		gvcf_helper.py
remove_multiallelic-keep_noREF.py		remove_multiallelic-keep_noREF.py
requirements.txt		requirements.txt
rescheduler.py		rescheduler.py
scheduler.py		scheduler.py

CoAdapTree/gatk_pipeline

Folders and files

Latest commit

History

Repository files navigation

Usage

GATK individual and poolseq pipeline

Assumed environment

Using the pipeline

About

Resources

Stars

Watchers

Forks

Languages