pypoolation

Using ipcluster engines to parallelize calculations from varscan_pipeline outfiles, report Tajimas'D and/or Tajima's pi and/or Watterson's theta over user-defined windows (as in VarianceExactCorrection.pm from popoolation). Alternatively, this code is likely to work relatively well with any txt files output by sending a VarScan vcf output to GATK's VariantsToTable.

Assumed environment

This code was written and tested with python 3.7.6. It seemed that python3.8 had issues with parallelization implementation; this issue was not addressed in current version.

Module versions used can be mirrored with pip install -r requirements.txt

example

The following is an example of environment setup and pypoolation execution (steps to git clone repo into /data/programs, and installing anaconda are not shown). If any errors are encountered when running pypoolation, make sure to execute ipcluster stop before re-running analysis. Creating an environment and installing requirements.txt only needs to be done once, afterwards activating the environment + exporting the PYTHONPATH is sufficient to run pypoolation.

# create new environment
conda create --name py37 python=3.7

# activate env
conda activate py37

# install reqs
cd /data/programs/pypoolation
pip install -r requirements.txt

# export pythonpath
export PYTHONPATH="${PYTHONPATH}:/data/programs/pypoolation"

# run pypoolation to calculate pi for pool named LPp19-2S_R1 using 20 CPUs
# (instantiation of INPUT, OUTDIR and PLOIDYFILE not shown)
python pypoolation.py -i INPUT \
-o OUTDIR \
-p PLOIDYFILE \
-e 20 \
-m pi \
--which-pops LPp19-2S_R1

Usage

usage: pypoolation.py [-h] -i INPUT -o OUTDIR -m MEASURE -p PLOIDYFILE -e
                      ENGINES [--ipcluster-profile PROFILE]
                      [--which-pops WHICHPOPS [WHICHPOPS ...]]
                      [--min-count MINCOUNT] [--min-coverage MINCOV]
                      [--max-coverage MAXCOV]
                      [--min-coverage-fraction MINCOVFRACTION]
                      [--window-size WINDOWSIZE] [--chromfile CHROMFILE]


required arguments:
  -i INPUT, --input INPUT
                        /path/to/VariantsToTable_output.txt
                        It is assumed that there is either a 'CHROM' or 'unstitched_chrom'
                        column (ref.fa record name), and either a 'locus' or 'unstitched_locus' column.
                        The 'locus' column elements are the hyphen-separated
                        CHROM-POS. If the 'unstitched_chrom' column is present, the code will use the
                        'unstitched_locus' column for SNP names, otherwise 'CHROM' and 'locus'. The
                        'unstitched_locus' elements are therefore the hyphen-separated 
                        unstitched_locus-unstitched_pos. DP, AD, and RD columns from VarScan are also 
                        assumed.
  -o OUTDIR, --outdir OUTDIR
                        /path/to/pypoolation_output_dir/
  -m MEASURE, --measure MEASURE
                        pi OR theta OR D.
  -p PLOIDYFILE, --ploidy PLOIDYFILE
                        /path/to/the/ploidy.pkl file output by the VarScan pipeline. This is a python
                        dictionary with key=pool_name, value=dict with key=pop, value=ploidy. The code
                        will prompt for pool_name. pop(s) can be input with --whichpops.
  -e ENGINES, --engines ENGINES
                        The number of ipcluster engines that will be launched.


optional arguments:
  -h, --help            show this help message and exit
  --ipcluster-profile PROFILE
                        The ipcluster profile name with which to start engines. Default: 'default'
  --which-pops WHICHPOPS [WHICHPOPS ...]
                        Specific populations you would like statistics calculated - entered as a space separated list. Default=all
  --min-count MINCOUNT  The minimum count of the minor allele. Default=2
  --min-coverage MINCOV
                        The minimum coverage of a site. Sites with a lower coverage
                        will not be considered. Default=8
  --max-coverage MAXCOV
                        The maximum coverage of a site. Sites with greater values
                        will not be considered. Default=1000
  --min-coverage-fraction MINCOVFRACTION
                        The minimum fraction of a window having SNPs between min-coverage
                        and max-coverage (fraction = num snps / windowsize). Default=0.0
  --window-size WINDOWSIZE
                        The size of the sliding window. Windows are centered on
                        individual SNPs, so any SNP within 0.5*windowsize bps
                        from the current SNP will be included in the window. For the
                        CoAdapTree data, this strategy was most computationally
                        efficient. Data was further filtered after output. Default=1000bp
  --chromfile CHROMFILE
                        Path to .txt file with each line a chromosome/contig name
                        that is wished to be evaluated. All SNPs on this
                        chromosome/contig will be evaluated. It is assumed that
                        these chromosomes are in the "CHROM" or "unstitched_chrom"
                        column of the input file. If 'unstitched_chrom' column
                        is present in input file, then this is the default column
                        to use to parallelize input, otherwise CHROM is used.
                        Default to analyze=All.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.gitmodules		.gitmodules
README.md		README.md
pypoolation.py		pypoolation.py
requirements.txt		requirements.txt
varmath.py		varmath.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitmodules

.gitmodules

README.md

README.md

pypoolation.py

pypoolation.py

requirements.txt

requirements.txt

varmath.py

varmath.py

Repository files navigation

pypoolation

Assumed environment

example

Usage

About

Releases

Packages

Contributors 2

Languages

CoAdapTree/pypoolation

Folders and files

Latest commit

History

Repository files navigation

pypoolation

Assumed environment

example

Usage

About

Resources

Stars

Watchers

Forks

Languages