If you use or are inspired by code from this repo, please site related manuscripts and data:
-
Zenodo - contains an archived release of this repository -
Lind B (2021) GitHub.com/brandonlind/cmh_test: preprint release (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.5083798
Using ipcluster engines to parallelize calculations from varscan_pipeline outfiles, calculate Cochran-Mantel-Haenszel chi-squared tests on stratified contingency tables.
Each stratum is a population. Each population has a "case" pool and a "control" pool. Together, these case and control pools make the contingency table. Each contingency table is 2x2 - case and control x REF and ALT allele counts.
ALT and REF allele counts are calculated by multiplying the ploidy of the pool by ... ... either the ALT freq or (1-ALT_freq).
This code was written and tested with python 3.7.6. It seemed that python3.8 had issues with parallelization implementation; this issue was not addressed in current version.
Module versions used can be mirrored with pip install -r requirements.txt
usage: cmh_test.py [-h] -i INPUT -o OUTDIR --case CASE --control CONTROL -p
PLOIDYFILE -e ENGINES [--ipcluster-profile PROFILE]
optional arguments:
-h, --help show this help message and exit
--ipcluster-profile PROFILE
The ipcluster profile name with which to start engines. Default: 'default'
required arguments:
-i INPUT, --input INPUT
/path/to/VariantsToTable_output.txt
It is assumed that there is either a 'locus' or 'unstitched_locus' column.
The 'locus' column elements are the hyphen-separated
CHROM-POS. If the 'unstitched_chrom' column is present, the code will use the
'unstitched_locus' column for SNP names, otherwise 'locus'. The
'unstitched_locus' elements are therefore the hyphen-separated
unstitched_chrom-unstitched_pos. FREQ columns from VarScan are also
assumed.
-o OUTDIR, --outdir OUTDIR
/path/to/cmh_test_output_dir/
File output from cmh_test.py will be saved in the outdir, with the original
name of the input file, but with the suffix "_CMH-test-results.txt"
--case CASE The string present in every column for pools in "case" treatments.
--control CONTROL The string present in every column for pools in "control" treatments.
-p PLOIDYFILE, --ploidy PLOIDYFILE
/path/to/the/ploidy.pkl file output by the VarScan pipeline. This is a python
dictionary with key=pool_name, value=dict with key=pop, value=ploidy. The code
will prompt for pool_name if necessary.
-e ENGINES, --engines ENGINES
The number of ipcluster engines that will be launched.