Skip to content

CoAdapTree/pypoolation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Codacy Badge

pypoolation

Using ipcluster engines to parallelize calculations from varscan_pipeline outfiles, report Tajimas'D and/or Tajima's pi and/or Watterson's theta over user-defined windows (as in VarianceExactCorrection.pm from popoolation). Alternatively, this code is likely to work relatively well with any txt files output by sending a VarScan vcf output to GATK's VariantsToTable.

Assumed environment

This code was written and tested with python 3.7.6. It seemed that python3.8 had issues with parallelization implementation; this issue was not addressed in current version.

Module versions used can be mirrored with pip install -r requirements.txt

example

The following is an example of environment setup and pypoolation execution (steps to git clone repo into /data/programs, and installing anaconda are not shown). If any errors are encountered when running pypoolation, make sure to execute ipcluster stop before re-running analysis. Creating an environment and installing requirements.txt only needs to be done once, afterwards activating the environment + exporting the PYTHONPATH is sufficient to run pypoolation.

# create new environment
conda create --name py37 python=3.7

# activate env
conda activate py37

# install reqs
cd /data/programs/pypoolation
pip install -r requirements.txt

# export pythonpath
export PYTHONPATH="${PYTHONPATH}:/data/programs/pypoolation"

# run pypoolation to calculate pi for pool named LPp19-2S_R1 using 20 CPUs
# (instantiation of INPUT, OUTDIR and PLOIDYFILE not shown)
python pypoolation.py -i INPUT \
-o OUTDIR \
-p PLOIDYFILE \
-e 20 \
-m pi \
--which-pops LPp19-2S_R1

Usage

usage: pypoolation.py [-h] -i INPUT -o OUTDIR -m MEASURE -p PLOIDYFILE -e
                      ENGINES [--ipcluster-profile PROFILE]
                      [--which-pops WHICHPOPS [WHICHPOPS ...]]
                      [--min-count MINCOUNT] [--min-coverage MINCOV]
                      [--max-coverage MAXCOV]
                      [--min-coverage-fraction MINCOVFRACTION]
                      [--window-size WINDOWSIZE] [--chromfile CHROMFILE]


required arguments:
  -i INPUT, --input INPUT
                        /path/to/VariantsToTable_output.txt
                        It is assumed that there is either a 'CHROM' or 'unstitched_chrom'
                        column (ref.fa record name), and either a 'locus' or 'unstitched_locus' column.
                        The 'locus' column elements are the hyphen-separated
                        CHROM-POS. If the 'unstitched_chrom' column is present, the code will use the
                        'unstitched_locus' column for SNP names, otherwise 'CHROM' and 'locus'. The
                        'unstitched_locus' elements are therefore the hyphen-separated 
                        unstitched_locus-unstitched_pos. DP, AD, and RD columns from VarScan are also 
                        assumed.
  -o OUTDIR, --outdir OUTDIR
                        /path/to/pypoolation_output_dir/
  -m MEASURE, --measure MEASURE
                        pi OR theta OR D.
  -p PLOIDYFILE, --ploidy PLOIDYFILE
                        /path/to/the/ploidy.pkl file output by the VarScan pipeline. This is a python
                        dictionary with key=pool_name, value=dict with key=pop, value=ploidy. The code
                        will prompt for pool_name. pop(s) can be input with --whichpops.
  -e ENGINES, --engines ENGINES
                        The number of ipcluster engines that will be launched.


optional arguments:
  -h, --help            show this help message and exit
  --ipcluster-profile PROFILE
                        The ipcluster profile name with which to start engines. Default: 'default'
  --which-pops WHICHPOPS [WHICHPOPS ...]
                        Specific populations you would like statistics calculated - entered as a space separated list. Default=all
  --min-count MINCOUNT  The minimum count of the minor allele. Default=2
  --min-coverage MINCOV
                        The minimum coverage of a site. Sites with a lower coverage
                        will not be considered. Default=8
  --max-coverage MAXCOV
                        The maximum coverage of a site. Sites with greater values
                        will not be considered. Default=1000
  --min-coverage-fraction MINCOVFRACTION
                        The minimum fraction of a window having SNPs between min-coverage
                        and max-coverage (fraction = num snps / windowsize). Default=0.0
  --window-size WINDOWSIZE
                        The size of the sliding window. Windows are centered on
                        individual SNPs, so any SNP within 0.5*windowsize bps
                        from the current SNP will be included in the window. For the
                        CoAdapTree data, this strategy was most computationally
                        efficient. Data was further filtered after output. Default=1000bp
  --chromfile CHROMFILE
                        Path to .txt file with each line a chromosome/contig name
                        that is wished to be evaluated. All SNPs on this
                        chromosome/contig will be evaluated. It is assumed that
                        these chromosomes are in the "CHROM" or "unstitched_chrom"
                        column of the input file. If 'unstitched_chrom' column
                        is present in input file, then this is the default column
                        to use to parallelize input, otherwise CHROM is used.
                        Default to analyze=All.

About

python scripts for population genetic statistics for data coming out of the varscan pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages