If you use or are inspired by code from this repo, please cite related manuscripts and data including, but not limited to:
-
Lind B (2021) GitHub.com/CoAdapTree/varscan_pipeline: Publication release (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.5083302
-
Lind et al. (in press) Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species. Accepted to Molecular Ecology Resources. Available on bioRxiv https://doi.org/10.1101/2020.10.07.329961
Lind BM. (2021c). GitHub.com/CoAdaptree/testdata_validation: Publication release (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.5083292
Call SNPs and INDELs across pooled populations using VarScan. Filter (mapping quality, proper pairs, MAF, GQ, missing data) and redirect SNPs from repeat regions or potential paralogous sites into distinct files. Once started, the pipeline will carry on through SNP filtering, automatically sbatching jobs when appropriate. If applied on startup, user will receive an email when pipeline is finished. Various ways to customize available, see help and usage below.
Command Specifics
- see docstrings at top of each .py file in repo for purpose, usage, assumptions, TODOs, etc.
- scroll each .py file to see bash text/replacements
- see example folder for example command.sh files output from the pipeline.
- 01_trim-fastq.py
- Trim fastq files with fastp by iterating across windows of size 5 and asserting mean quality of 30; discard any reads that do not retain 75bp or have more than 20 N's. Attempt to remove adaptors if provided in datatable.txt
- 02_bwa-map_view-sort_index_flagstat.py
- Add read groups, map with bwa-mem, filter reads that have mapping quality < 20 or are not proper pairs; discard query unmapped
- 03_mark_build.py
- mark and remove duplicates with picardtools
- start_varscan.py
- call
mpileup2cns
with default flags except:--min-avg-qual 20
(minimum base quality set to 20),--p-value 0.05
(max p-value set to 0.05).
- call
Workflow features
- To reduce total wait times before a scheduled job begins to run, at each stage of the pipeline outlined above the pipeline will evenly spread Priority jobs in queue across any number of slurm accounts available to user (see docstring of balance_queue.py)
- This script can also be run manually from command line and can be used outside of pipeline purposes
Final file output by pipeline
- The final file will be of the form f'{pool_name}-varscan_all_bedfiles_TYPE.txt', where TYPE is one of SNP +/- PARALOGS, or REPEATS, depending on input flags to 00_start-pipeline.
- Access to an HPC with a scheduler (e.g., slurm, SGE, PBS, TORQUE) - this pipeline assumes slurm with the Compute Canada servers in mind (not meant as a deployable 'program')
- Ability to load the following modules via:
module load bwa/0.7.17
module load fastp/0.19.5
module load java
module load samtools/1.9
module load picard/2.18.9
module load gatk/3.8
andmodule load gatk/4.1.0.0
module load bcftools/1.9
- Download and install VarScan (
VarScan.v2.4.3.jar
). The path to the jar executable will be exported inbash_variables
below. Download VarScan from here: https://github.com/dkoboldt/varscan - In the
parentdir
folder that contains the fastq files, copy the following into a file calledbash_variables
. Thedef-someuser
reflects your compute canada account that you would like to use to submit jobs. If you have multiple accounts available, the pipeline will balance load among them (you choose these accounts during 00_start execution). The following is needed to submit jobs before the pipeline balances load. See example file in GitHub repository.export SLURM_ACCOUNT=def-someuser export SBATCH_ACCOUNT=$SLURM_ACCOUNT export SALLOC_ACCOUNT=$SLURM_ACCOUNT export PYTHONPATH="${PYTHONPATH}:$HOME/pipeline" export SQUEUE_FORMAT="%.8i %.8u %.15a %.68j %.3t %16S %.10L %.5D %.4C %.6b %.7m %N (%r)" export VARSCAN_DIR=/project/def-yeaman/CoAdepTree/apps # placeholder for python environment activation (see below)
- The following is assumed regarding the name of slurm accounts found by
sshare -U --user $USER --format=Account
:- The total character length of the account name is less than 15 - the full slurm account name will need to appear in the ACCOUNT column output from
squeue -u $USER
(using exportedSQUEUE_FORMAT
above); if not, increase the digits inSQUEUE_FORMAT
from%0.15a
to eg%0.20a
. - The characters in the account name that come before an underscore are sufficient to distinguish unique accounts - if the account name does not have an underscore then this is fine.
- The accounts that the user would like to use end in
_cpu
(as opposed to eg_gpu
). The pipeline will skip over non-_cpu
accounts.
- The total character length of the account name is less than 15 - the full slurm account name will need to appear in the ACCOUNT column output from
- The following is assumed regarding the name of slurm accounts found by
- Ability to install virtual environment with python 3.7 (e.g.,
virtualenv --no-download ~/py3
)- See example instructions here: https://docs.computecanada.ca/wiki/Python
- Add the appropriate activation command to the
bash_variables
file (egsource ~/py3/bin/activate
for virutalenv)
- The reference fasta file should be uncompressed (eg. ref.fa not ref.fa.gz), and the following commands should be executed before starting pipeline:
bwa index ref.fa
samtools faidx ref.fa
picard.jar CreateSequenceDictionary R=ref.fa O=ref.dict
- clone the pipeline repo's master branch to the server and create a symlink in
$HOME
so that it can be accessed via$HOME/pipeline
-
First create a python3 environment (see above).
- then use the
requirements.txt
file in the repo to install the appropriate python dependencies.pip install -r $HOME/pipeline/requirements.txt
- then use the
-
See example datatable.txt file needed for
00_start-pipeline.py
.- file names in
file_name_r1
andfile_name_r2
should be basenames not full paths - the entries in the
ref
column should be full paths - sample names in 'sample_name' column can be duplicates for samps sequenced multiple times
- RG info, file paths, etc should of course be different between sequenced files of single samps
- file names in
-
Once the environment is set up, put
datatable.txt
and the fastq files (symlinks work too) into a folder. This is the folder I callPARENTDIR
. -
To kick off the pipeline, source your bash_variables file in parentdir (source bash_variables) to activate the python env, export the pythonpath to the pipeline and other slurm variables. Then run
00_start-pipeline.py
from the login node, and it will run the rest of the preprocessing pipeline automatically by serially sbatching jobs up through SNP calling and filtering. If the user has chosedpipeline-finish
as an email option, the pipeline will email the user once the pipeline is complete. Iffail
is chosen as an email option, the user will be emailed if any jobs die. -
Once the pipeline has finished you will get an email (assuming you choose the 'pipeline-finish' email notification) for each unique pool in the
pool_name
column ofdatatable.txt
. Once all pools have finished, the next step is to run98_get_read_stats.py
(see docstring at top of file for usage) to get summaries of read counts from pre-pipeline, through trimming, mapping, and realignment. Next, run99_bundle_files_for_transfer.py
(see docstring at top of file for usage) to bundle all necessary files to transfer to a local server. This script will creat a .txt file with rsync commands that the user can execute from the remote server, and can also create md5 files for .bam and varscan output.
(py3) [user@host ~]$ python $HOME/pipeline/00_start-pipeline.py -p PARENTDIR [-e EMAIL] [-n EMAIL_OPTIONS [EMAIL_OPTIONS ...]] [-maf MAF] [--translate] [--rm_repeats] [--rm_paralogs] [-h]
required arguments:
-p PARENTDIR /path/to/directory/with/fastq.gz-files/
optional arguments:
-e EMAIL the email address you would like to have notifications
sent to
-n EMAIL_OPTIONS [EMAIL_OPTIONS ...]
the type(s) of email notifications you would like to
receive from the pipeline. Requires --email-address.
These options are used to fill out the #SBATCH flags.
Must be one (or multiple) of
['all', 'fail', 'begin', 'end', 'pipeline-finish']
(default: None)
-maf MAF At the end of the pipeline, VCF files will be filtered
for MAF. If the pipeline is run on a single
population/pool, the user can set MAF to 0.0 so as to
filter variants based on global allele frequency
across populations/pools at a later time. (if the
number of sample_names in a pool == 1 then default
maf=0; Otherwise maf = 1/sum(ploidy column)
--translate Boolean: true if used, false otherwise. If a stitched
genome is used for mapping, this option will look for
a ref.order file in the same directory as the
ref.fasta - where ref is the basename of the ref.fasta
(without the .fasta). The pipeline will use this
.order file to translate mapped positions to
unstitched positions at the end of the pipeline while
filtering. Positions in .order file are assumed to be
1-based indexing. Assumes .order file has no header,
and is of the format (contig name from unstitched
genome, start/stop are positions in the stitched genome):
ref_scaffold<tab>contig_name<tab>start_pos<tab>stop_pos<tab>contig_length
(default: False)
--rm_repeats Boolean: true if used, false otherwise. If repeat
regions are available, remove SNPs that fall within
these regions from final SNP table and write to
a REPEATS table. This option will look for a .txt file
in the same directory as the ref.fasta. Assumes the
filename is of the form: ref_repeats.txt - where ref
is the basename of the ref.fasta (without the .fasta).
This file should have 1-based indexing and should be
located in the same directory as the reference. The
file should have a header ('CHROM', 'start', 'stop').
The CHROM column can be names in the reference (if
using unstitched reference), or names of contigs that
were stitched to form the reference. If using a
stitched genome, --translate is required. (default:
False)
--rm_paralogs Boolean: true if used, false otherwise. If candidate
sites have been isolated within the reference where
distinct gene copies (paralogs) map to the same
position (and thus create erroneous SNPs), remove any
SNPs that fall on these exact sites and write to a
PARALOGS file. The pipeline assumes this file is
located in the parentdir, andends with
'_paralog_snps.txt'. This file is tab-delimited, and
must have a column called 'locus' thatcontains
hyphen-separated CHROM-POS sites for paralogs.
These sites should be found in the current ref.fa
being used to call SNPs (otherwise SNPs cannot be
filtered by these sites). (default: False)
-h, --help Show this help message and exit.