This repository provides a Python package implementing part of the analyses presented in:
- Sarkar, A. K., Ward, L. D., & Kellis, M. (2016). Functional enrichments of disease variants across thousands of independent loci in eight diseases. bioRxiv. http://dx.doi.org/10.1101/048066
The R package is available from http://www.github.com/aksarkar/frea-R. The computational pipeline described in the text (which utilizes these packages) is available from http://www.github.com/aksarkar/frea-pipeline
pip install git+git://github.com:aksarkar/frea.git#egg=frea
The Python package requires:
- Python > 3.2
- numpy
- scipy
The design of the packages is based on several ideas, which are dependent on the characteristics of the compute environment they were developed in (Univa Grid Engine, relatively strict memory limits, but many compute nodes):
- Use independent Python processes to distribute work in massively parallel fashion across compute nodes (using mechanisms outside of Python such as GNU parallel)
- Use streaming algorithms wherever possible, building as few intermediate data structures as needed
- Invoke modules as scripts (python -m) for entry points wherever possible
- Use R to produce visualizations
The fundamental computations here are:
- Coerce the data into a standardized (internal) format
- Convert the internal format to UCSC BED for use downstream
- Lift over and impute UCSC BED (as needed)
The fundamental computations here are:
- Motivate the method by estimating rank-based correlations between replicate summary statistics
- Prune summary statistics according to linkage disequilibrium
- Estimate the heuristic p-value cutoff for taking forward in the analysis
- Calculate permutation-based significance of enrichment for every annotation
- Draw and annotate the plots
The fundamental computations here is pruning enriched GO terms according to the Gene Ontology. We parse the Open Biomedical Ontology format file to build an adjacency list representing the graph of GO relationships. The algorithm is straightforward: perform DFS on the ontology subgraph induced by the enriched terms, keeping the deepest nodes found (the fringe) at every point.
The algorithm is implemented in two coroutines: the DFS coroutine yields events (either starting from an unexplored node or moving along an edge) and the current node. If we start from an unexplored node, we add it to the fringe; if we move along an edge, we remove the node from the fringe (if it was present). At the end of the algorithm, the fringe contains nodes which were visited only once in the traversal and therefore the deepest (most specific) enriched GO terms.
The fundamental computations here are:
- Counting motif overlaps, possibly grouping by PWM/factor group/PWM cluster
- Compute significance of motif enrichments based on overlap counts
- Draw and annotate heatmaps