parEBEN - Parallel Implementations of the Empirical Bayesian Elastic Net Cross-Validation in R

Colby T. Ford, Ph.D.

Abstract

The Empirical Bayesian Elastic Net (EBEN) algorithm was developed by Huang et al. for handling multicollinearity in generalized linear regression models. Historically, this has been used in the analysis of quantitative trait loci (QTLs) and gene-gene interactions (epistasis). In addition to the algorithm, the group also created the EBEN package for R. This package includes functions to generate the elastic nets for both binomial and gaussian priors. These functions are efficient and do not require large amounts of computational time. However, the package also includes functions for the cross-validation of those models. While essential, this step is a considerably more complex task. The cross-validation functions perform a sweep to determine hyperparameters and minimize prediction error. More specifically, an n-fold cross-validation sweep is performed to minimize error by trying combinations of two parameters (α and λ) in a stepped manner. Experimentally, it has been shown that this can take a rather extended amount of time, especially on larger datasets (as seen in genomics problems).

To combat this complexity issue, the parallelization of the cross-validation functions was performed by employing parallel packages in R. By parallelizing the iterations of the cross-validation over multiple CPU cores or multiple machines of a computing clusters, a drastic time reduction can seen with no negative effect on the resulting EBEN models. By reducing the computation time, regression models on larger, more complex data can be completed without such a delay. This also opens the door for larger datasets to be analyzed as opposed to limiting the research due to time and computing resource constraints. Thus, parallelizing the cross-validation of the EBEN models will prove to be greatly beneficial in future research using cross-validated Bayesian elastic nets.

Time Reduction Benchmark

To interactively view cross-validation time benchmarks between parEBEN and the original EBEN package, click here

Installation

You can install the latest stable version from GitHub using the following command:

library(devtools)
install_github("colbyford/parEBEN")
library(parEBEN)

Usage

First, select the parallelization method you wish to use. Currently, all foreach-related methods are supported such as doParallel, doMPI, and doSNOW.

Initialize The Cluster

Note: Refer to the manual for your desired foreach parallelization package as the initialization may differ between methods.

Local Parallel

library(doParallel)
no_cores <- detectCores()
cl <- makeCluster(no_cores)
#clusterExport(cl, c("CrossValidate"))
registerDoParallel(cl)

Cluster Distribution

library(doMPI)
# create and register a doMPI cluster if necessary
if (!identical(getDoParName(), 'doMPI')) {
  # set count to (cores_requested-1)
  cl <- startMPIcluster(count=255,verbose=TRUE)
  registerDoMPI(cl)
}

Microsoft Machine Learning Server Distribution

## Set your compute contaxt as Spark, local parallel, MapReduce, etc.
### See: https://docs.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxspark
### Sample Code: https://gist.github.com/premalxyz/e97ae7823052b7a426cb816830c0188c#file-spark_compute_context-r

mySparkCluster <- RxSpark(ClusterInfo)
rxSetComputeContext(mySparkCluster)

## Register the context using doRSR
library(doRSR)
registerDoRSR()

Begin the Cross-Validation

## Load in data and required EBEN and parEBEN packages
library(EBEN)
library(parEBEN)

## Create small sample matrix for testing
data(BASIS)
data(y)
n = 50
k = 100
BASIS = BASIS[1:n,1:k]
y  = y[1:n]

parEBENcv <- CrossValidate(BASIS,
                           y,
                           nFolds = 3,
                           Epis = "no",
                           prior = "gaussian",
                           search = "global"
                           )

## Use the optimal values in the EBEN model
EBENoutput <- EBelasticNet.Gaussian(BASIS,
                                    y,
                                    lambda = parEBENcv$lambda.optimal,
                                    alpha = parEBENcv$alpha.optimal,
                                    Epis = "no",
                                    verbose = 1)

To Do List

Publication and How To Cite

Publication: https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa216/5813727

Data and materials used in publication can be found here.

Jia Wen, Colby T Ford, Daniel Janies, Xinghua Shi, A Parallelized Strategy for Epistasis Analysis Based on Empirical Bayesian Elastic Net Models, Bioinformatics, , btaa216, https://doi.org/10.1093/bioinformatics/btaa216

or using BibTeX...

@article{10.1093/bioinformatics/btaa216,
    author = {Wen, Jia and Ford, Colby T and Janies, Daniel and Shi, Xinghua},
    title = "{A Parallelized Strategy for Epistasis Analysis Based on Empirical Bayesian Elastic Net Models}",
    journal = {Bioinformatics},
    year = {2020},
    month = {03},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btaa216},
    url = {https://doi.org/10.1093/bioinformatics/btaa216},
    note = {btaa216},
    eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaa216/32981158/btaa216.pdf},
}

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details

Funding

This project was funded in part by NIH R15HG009565.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.Rproj.user		.Rproj.user
EBEN_orig		EBEN_orig
R		R
data		data
img		img
man		man
paper_materials		paper_materials
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
Abstract.txt		Abstract.txt
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
_config.yml		_config.yml
parEBEN.Rproj		parEBEN.Rproj
parEBEN_0.9.10.tar.gz		parEBEN_0.9.10.tar.gz

License

colbyford/parEBEN

Folders and files

Latest commit

History

Repository files navigation

parEBEN - Parallel Implementations of the Empirical Bayesian Elastic Net Cross-Validation in R

Colby T. Ford, Ph.D.

Abstract

Time Reduction Benchmark

Installation

Usage

Initialize The Cluster

Local Parallel

Cluster Distribution

Microsoft Machine Learning Server Distribution

Begin the Cross-Validation

To Do List

Publication and How To Cite

License

Funding

About

Topics

Resources

License

Stars

Watchers

Forks

Languages