Using ipcluster engines to parallelize calculations from varscan_pipeline outfiles, report Tajimas'D and/or Tajima's pi and/or Watterson's theta over user-defined windows (as in VarianceExactCorrection.pm from popoolation). Alternatively, this code is likely to work relatively well with any txt files output by sending a VarScan vcf output to GATK's VariantsToTable.
This code was written and tested with python 3.7.6. It seemed that python3.8 had issues with parallelization implementation; this issue was not addressed in current version.
Module versions used can be mirrored with pip install -r requirements.txt
The following is an example of environment setup and pypoolation execution (steps to git clone
repo into /data/programs
, and installing anaconda are not shown). If any errors are encountered when running pypoolation, make sure to execute ipcluster stop
before re-running analysis. Creating an environment and installing requirements.txt only needs to be done once, afterwards activating the environment + exporting the PYTHONPATH is sufficient to run pypoolation.
# create new environment
conda create --name py37 python=3.7
# activate env
conda activate py37
# install reqs
cd /data/programs/pypoolation
pip install -r requirements.txt
# export pythonpath
export PYTHONPATH="${PYTHONPATH}:/data/programs/pypoolation"
# run pypoolation to calculate pi for pool named LPp19-2S_R1 using 20 CPUs
# (instantiation of INPUT, OUTDIR and PLOIDYFILE not shown)
python pypoolation.py -i INPUT \
-o OUTDIR \
-p PLOIDYFILE \
-e 20 \
-m pi \
--which-pops LPp19-2S_R1
usage: pypoolation.py [-h] -i INPUT -o OUTDIR -m MEASURE -p PLOIDYFILE -e
ENGINES [--ipcluster-profile PROFILE]
[--which-pops WHICHPOPS [WHICHPOPS ...]]
[--min-count MINCOUNT] [--min-coverage MINCOV]
[--max-coverage MAXCOV]
[--min-coverage-fraction MINCOVFRACTION]
[--window-size WINDOWSIZE] [--chromfile CHROMFILE]
required arguments:
-i INPUT, --input INPUT
/path/to/VariantsToTable_output.txt
It is assumed that there is either a 'CHROM' or 'unstitched_chrom'
column (ref.fa record name), and either a 'locus' or 'unstitched_locus' column.
The 'locus' column elements are the hyphen-separated
CHROM-POS. If the 'unstitched_chrom' column is present, the code will use the
'unstitched_locus' column for SNP names, otherwise 'CHROM' and 'locus'. The
'unstitched_locus' elements are therefore the hyphen-separated
unstitched_locus-unstitched_pos. DP, AD, and RD columns from VarScan are also
assumed.
-o OUTDIR, --outdir OUTDIR
/path/to/pypoolation_output_dir/
-m MEASURE, --measure MEASURE
pi OR theta OR D.
-p PLOIDYFILE, --ploidy PLOIDYFILE
/path/to/the/ploidy.pkl file output by the VarScan pipeline. This is a python
dictionary with key=pool_name, value=dict with key=pop, value=ploidy. The code
will prompt for pool_name. pop(s) can be input with --whichpops.
-e ENGINES, --engines ENGINES
The number of ipcluster engines that will be launched.
optional arguments:
-h, --help show this help message and exit
--ipcluster-profile PROFILE
The ipcluster profile name with which to start engines. Default: 'default'
--which-pops WHICHPOPS [WHICHPOPS ...]
Specific populations you would like statistics calculated - entered as a space separated list. Default=all
--min-count MINCOUNT The minimum count of the minor allele. Default=2
--min-coverage MINCOV
The minimum coverage of a site. Sites with a lower coverage
will not be considered. Default=8
--max-coverage MAXCOV
The maximum coverage of a site. Sites with greater values
will not be considered. Default=1000
--min-coverage-fraction MINCOVFRACTION
The minimum fraction of a window having SNPs between min-coverage
and max-coverage (fraction = num snps / windowsize). Default=0.0
--window-size WINDOWSIZE
The size of the sliding window. Windows are centered on
individual SNPs, so any SNP within 0.5*windowsize bps
from the current SNP will be included in the window. For the
CoAdapTree data, this strategy was most computationally
efficient. Data was further filtered after output. Default=1000bp
--chromfile CHROMFILE
Path to .txt file with each line a chromosome/contig name
that is wished to be evaluated. All SNPs on this
chromosome/contig will be evaluated. It is assumed that
these chromosomes are in the "CHROM" or "unstitched_chrom"
column of the input file. If 'unstitched_chrom' column
is present in input file, then this is the default column
to use to parallelize input, otherwise CHROM is used.
Default to analyze=All.