GitHub - ammodramus/ebsom: Empirical Bayes somatic variant calling

This repository implements a novel empirical Bayesian model for low-frequency SNP calling with high-coverage pooled-sample/high-ploidy/somatic DNA sequencing data. This prototype works well on simulated data but is not currently working on the real data for which it was designed. Use with caution.

The basic approach is to jointly model the DNA sequencing error process and the global allele-frequency distribution, obtaining a maximum-likelihood estimate of each. The inferred global distribution of allele frequencies is employed as the prior distribution in calculating the posterior distribution of allele frequencies at individual candidate loci, making this an empirical Bayesian approach.

The sequencing error model is a fully-connected artificial neural network, where for each base-call the inputs are a number of covariates associated with that base-call, including the true base, base quality, mapping quality, read number, position along the read, and potential for contamination at the site (allowing for index-swapping of multiplexed libraries).

For a more concrete (but still brief) mathematical description, see README.pdf.

Installation

This module requires the following Python packages:

Cython is also required.

Once dependencies are installed, clone this directory and run python setup.py build_ext -i.

Usage

There are three steps to running this model and calculating the posterior distribution of allele frequencies.

Collect the covariate data from the .BAM alignments:
```
python collect_data.py bamlist.txt chromlist.txt data.h5
```
Here bamlist.txt is a file containing a list of .BAM alignments from the same sequencing lane, chromlist.txt is a file containing a list of chromosomes/contigs to analyze (or the name of a single chromosome/contig), and output.h5 is the name of the HDF5 data file to be created. See python collect_data.py -h for details and more options.
Optimize the global model:
```
python run_model.py data.h5 --save-model error.model
```
data.h5 is the data file produced in the first step, and error.model is the filepath (a directory, on POSIX systems) in which the optimized model parameters will be saved. See python run_model.py -h for options that can be used to control the optimization.
Calculate and output the (log-) posterior distribution of allele frequencies at each site:
```
python calculate_posterior_probabilities.py data.h5 error.model
```
data.h5 and error.model are as in the previous two steps. This prints to STDOUT a tab-separated table of the most-likely frequency (i.e., the MAP estimate) and the entire log-posterior distribution of the discrete allele frequencies for each alignment file, chromosome, and position. The discrete allele frequencies are output in a '#'-prefixed comment line.

Name		Name	Last commit message	Last commit date
Latest commit History 309 Commits
README.ipynb		README.ipynb
README.md		README.md
README.pdf		README.pdf
beta_with_spikes_integrated.py		beta_with_spikes_integrated.py
calculate_posterior_probabilities.py		calculate_posterior_probabilities.py
collect_data.py		collect_data.py
cyglobal.pyx		cyglobal.pyx
cylocobs.pxd		cylocobs.pxd
cylocobs.pyx		cylocobs.pyx
cyprocessreads.pyx		cyprocessreads.pyx
cyregcov.pxd		cyregcov.pxd
cyregcov.pyx		cyregcov.pyx
cyrowmaker.pxd		cyrowmaker.pxd
cyrowmaker.pyx		cyrowmaker.pyx
cyutil.pxd		cyutil.pxd
cyutil.pyx		cyutil.pyx
h5py_util.py		h5py_util.py
likelihood_layer.py		likelihood_layer.py
run_model.py		run_model.py
setup.py		setup.py
simulate_data.py		simulate_data.py
util.py		util.py

ammodramus/ebsom

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

About

Topics

Resources

Stars

Watchers

Forks

Languages