diff --git a/.Rbuildignore b/.Rbuildignore index 112ad26..dbe9f0f 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -1,3 +1,6 @@ -^.*\.Rproj$ +### For some reason Hadley's regex's from http://r-pkgs.had.co.nz/package.html are being ignored, so do this manually +^cran-comments\.md$ +^README\.Rmd$ +^sparsebn\.Rproj$ ^\.Rproj\.user$ ^\.travis\.yml$ diff --git a/.travis.yml b/.travis.yml index 5a540bf..020bf80 100644 --- a/.travis.yml +++ b/.travis.yml @@ -4,6 +4,15 @@ language: R sudo: false cache: packages +#r_build_args: --no-build-vignettes --no-manual --no-resave-data +#r_check_args: --no-build-vignettes --no-manual + +# Test master and dev branches +branches: + only: + - master + - dev + # Test on release, old, and development versions of R r: - oldrel diff --git a/DESCRIPTION b/DESCRIPTION index 4588297..6e55b46 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,10 +1,10 @@ Package: sparsebn Title: Learning Sparse Bayesian Networks from High-Dimensional Data -Version: 0.0.0.9000 -Date: 2016-04-05 +Version: 0.0.1 +Date: 2016-08-12 Authors@R: person("Bryon", "Aragam", email = "sparsebn@gmail.com", role = c("aut", "cre")) Maintainer: Bryon Aragam -Description: Methods and tools for learning sparse Bayesian networks from high-dimensional data using sparse regularization. +Description: Fast methods for learning sparse Bayesian networks from high-dimensional data using sparse regularization. Designed to incorporate mixed experimental and observational data with thousands of variables with either continuous or discrete observations. Depends: R (>= 3.2.3), sparsebnUtils, @@ -15,6 +15,7 @@ Suggests: rmarkdown, mvtnorm, igraph, + graph, testthat URL: https://github.com/itsrainingdata/sparsebn BugReports: https://github.com/itsrainingdata/sparsebn/issues diff --git a/R/data.R b/R/data.R index 6fc00c7..4081d02 100644 --- a/R/data.R +++ b/R/data.R @@ -1,18 +1,49 @@ #' The pathfinder network #' -#' The pathfinder network from the \href{http://www.bnlearn.com/bnrepository/#pathfinder}{Bayesian network repository}. -#' Pathfinder is an expert system developed in 1992 by Heckerman et. al +#' Simulated data and network for the pathfinder network from the \href{http://www.bnlearn.com/bnrepository/#pathfinder}{Bayesian network repository}. +#' Pathfinder is an expert system developed by Heckerman et. al (1992) #' to assist with the diagnosis of lymph-node diseases. #' +#' The data is simulated from a Gaussian SEM assuming unit edge weights and +#' unit variances for all nodes. +#' +#' @format A \code{\link{list}} with four components: +#' +#' \itemize{ +#' \item \code{dag} An \code{\link[sparsebnUtils]{edgeList}} containing the pathfinder network (109 nodes, 195 edges). +#' \item \code{data} A \code{\link{data.frame}} with 109 variables and 1000 observations. +#' \item \code{ivn} A \code{\link{list}} specifying which nodes are under intervention in each observation; since this dataset +#' is purely observational, this is just \code{NULL}. Compatible with the input to \code{\link[sparsebnUtils]{sparsebnData}}. +#' \item \code{cov} Covariance matrix used to generate the data. +#' } +#' #' @format An \code{\link{edgeList}} object with 109 nodes and 195 edges. #' +#' @usage +#' data(pathfinder) +#' +#' @examples +#' \dontrun{ +#' # Create a valid sparsebnData object from the simulated pathfinder data +#' data(pathfinder) +#' dat <- sparsebnData(pathfinder$data, type = "c") +#' +#' # If desired, change the edge weights to be randomly generated +#' coefs <- get.adjacency.matrix(pathfinder$dag) +#' coefs[coefs != 0] <- runif(n = num.edges(pathfinderDAG), min = 0.5, max = 2) +#' vars <- Matrix::Diagonal(n = num.nodes(pathfinderDAG), x = rep(1, num.nodes(pathfinderDAG))) +#' id <- vars +#' covMat <- t(solve(id - coefs)) %*% vars %*% solve(id - coefs) +#' pathfinder.data <- rmvnorm(n = 1000, sigma = as.matrix(covMat)) +#' } +#' "pathfinder" -#' The cytometry network +#' The discrete cytometry network #' #' Data and network for analyzing the flow cytometry experiment #' from \href{http://science.sciencemag.org/content/308/5721/523.long}{Sachs et al. (2005)}. -#' The data is a cleaned and discretized version of the raw data from these experiments. +#' The data is a cleaned and discretized version of the raw data (see \code{\link{cytometryContinuous}}) from these experiments. #' #' @format A \code{\link{list}} with three components: #' @@ -23,4 +54,37 @@ #' Compatible with the input to \code{\link[sparsebnUtils]{sparsebnData}}. #' } #' +#' @usage +#' data(cytometryDiscrete) +#' +#' @examples +#' # Create a valid sparsebnData object from the cytometry data +#' data(cytometryDiscrete) +#' dat <- sparsebnData(cytometryDiscrete$data, type = "d", ivn = cytometryDiscrete$ivn) +#' "cytometryDiscrete" + +#' The continuous cytometry network +#' +#' Data and network for analyzing the flow cytometry experiment +#' from \href{http://science.sciencemag.org/content/308/5721/523.long}{Sachs et al. (2005)}. +#' This dataset contains the raw measurements from these experiments. +#' +#' @format A \code{\link{list}} with three components: +#' +#' \itemize{ +#' \item \code{dag} An \code{\link[sparsebnUtils]{edgeList}} containing the consensus network (11 nodes, 17 edges). +#' \item \code{data} A \code{\link{data.frame}} with 11 variables and 7466 observations. +#' \item \code{ivn} A \code{\link{list}} specifying which nodes are under intervention in each observation. +#' Compatible with the input to \code{\link[sparsebnUtils]{sparsebnData}}. +#' } +#' +#' @usage +#' data(cytometryContinuous) +#' +#' @examples +#' # Create a valid sparsebnData object from the cytometry data +#' data(cytometryContinuous) +#' dat <- sparsebnData(cytometryContinuous$data, type = "c", ivn = cytometryContinuous$ivn) +#' +"cytometryContinuous" diff --git a/R/sparsebn-main.R b/R/sparsebn-main.R index b8f81fd..8260245 100644 --- a/R/sparsebn-main.R +++ b/R/sparsebn-main.R @@ -11,6 +11,8 @@ # # CONTENTS: # estimate.dag +# estimate.covariance +# estimate.precision # #' Estimate a DAG from data @@ -42,6 +44,15 @@ #' #' @return A \code{\link[sparsebnUtils]{sparsebnPath}} object. #' +#' @examples +#' +#' \dontrun{ +#' # Estimate a DAG from the cytometry data +#' data(cytometryContinuous) +#' dat <- sparsebnData(cytometryContinuous$data, type = "d", ivn = cytometryContinuous$ivn) +#' estimate.dag(dat) +#' } +#' #' @export estimate.dag <- function(data, lambdas = NULL, @@ -117,6 +128,15 @@ estimate.dag <- function(data, #' corresponding to an estimate of the covariance or precision (inverse covariance) matrix for a #' given value of lambda. #' +#' @examples +#' +#' \dontrun{ +#' data(cytometryContinuous) +#' dat <- sparsebnData(cytometryContinuous$data, type = "d", ivn = cytometryContinuous$ivn) +#' estimate.covariance(dat) # estimate covariance +#' estimate.precision(dat) # estimate precision +#' } +#' #' @name estimate.covariance #' @rdname estimate.covariance NULL diff --git a/R/zzz.R b/R/zzz.R index 718cb58..1d15f31 100644 --- a/R/zzz.R +++ b/R/zzz.R @@ -10,9 +10,10 @@ .onAttach <- function(libname, pkgname){ msg <- paste0("\n", - "sparsebn v0.1, Copyright (c) 2016\n", + "sparsebn v0.0.1, Copyright (c) 2016\n", "\tBryon Aragam, University of California, Los Angeles\n", "\tJiaying Gu, University of California, Los Angeles\n", + "\tDacheng Zhang, University of California, Los Angeles\n", "\tQing Zhou, University of California, Los Angeles\n", "\tFei Fu\n", "\n", diff --git a/README.Rmd b/README.Rmd index c6034ab..2bf3b9a 100644 --- a/README.Rmd +++ b/README.Rmd @@ -16,8 +16,7 @@ knitr::opts_chunk$set( [![Travis-CI Build Status](https://travis-ci.org/itsrainingdata/sparsebn.svg?branch=master)](https://travis-ci.org/itsrainingdata/sparsebn) -Methods for learning sparse Bayesian networks and other graphical models from -high-dimensional data via sparse regularization. Designed to handle: +Methods for learning sparse Bayesian networks and other graphical models from high-dimensional data via sparse regularization. Designed to handle: - Experimental data with interventions - Mixed observational / experimental data diff --git a/cran-comments.md b/cran-comments.md new file mode 100644 index 0000000..850bb13 --- /dev/null +++ b/cran-comments.md @@ -0,0 +1,33 @@ +## Test environments +* local OS X install, R 3.3.1 +* ubuntu 12.04 (travis-ci), R 3.3.1 +* win-builder (devel and release) + +## R CMD check results +There were no ERRORs or WARNINGs. + +There was 1 NOTE: + +* checking CRAN incoming feasibility ... NOTE +Maintainer: ‘Bryon Aragam ’ + +New submission + +This is the first version of this package that has been submitted to CRAN. + +## Re-submission notes +- I get + +Reading CITATION file fails with + $ operator is invalid for atomic vectors +when package is not installed. + +Fixed by replacing packageDecription with meta as described in Section 1.9 of R-exts. + +## Dependencies + +CHECK has been run on all dependencies and passed. + +## Reverse dependencies + +This is a new release, so there are no reverse dependencies. diff --git a/data/cytometryContinuous.rda b/data/cytometryContinuous.rda new file mode 100644 index 0000000..78527fa Binary files /dev/null and b/data/cytometryContinuous.rda differ diff --git a/data/cytometryDiscrete.rda b/data/cytometryDiscrete.rda index 7505c9f..9fd3593 100644 Binary files a/data/cytometryDiscrete.rda and b/data/cytometryDiscrete.rda differ diff --git a/data/pathfinder.rda b/data/pathfinder.rda index 3c22b5c..51cd7e2 100644 Binary files a/data/pathfinder.rda and b/data/pathfinder.rda differ diff --git a/inst/CITATION b/inst/CITATION new file mode 100644 index 0000000..7108de0 --- /dev/null +++ b/inst/CITATION @@ -0,0 +1,18 @@ +citHeader("To cite sparsebn in publications use:") + +year <- sub(".*(2[[:digit:]]{3})-.*", "\\1", meta$Date, perl = TRUE) +vers <- paste("R package version", meta$Version) + +citEntry(entry = "Manual", + title = "sparsebn: Learning Sparse Bayesian Networks from High-Dimensional Data", + author = personList(as.person("Bryon Aragam"), as.person("Jiaying Gu"), as.person("Dacheng Zhang"), as.person("Fei Fu"), as.person("Qing Zhou")), + year = year, + note = vers, + url = "https://github.com/itsrainingdata/sparsebn", + textVersion = paste0("Bryon Aragam, Jiaying Gu, Dacheng Zhang, Fei Fu, and Qing Zhou (", year, "). ", + "sparsebn: Learning Sparse Bayesian Networks from High-Dimensional Data. ", + vers, ".\n", + "https://github.com/itsrainingdata/sparsebn") +) + +citFooter("A lot of effort has gone into building and maintaing the sparsebn package. Please cite it in all papers where it is used.") diff --git a/man/cytometryContinuous.Rd b/man/cytometryContinuous.Rd new file mode 100644 index 0000000..8efc2ee --- /dev/null +++ b/man/cytometryContinuous.Rd @@ -0,0 +1,30 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/data.R +\docType{data} +\name{cytometryContinuous} +\alias{cytometryContinuous} +\title{The continuous cytometry network} +\format{A \code{\link{list}} with three components: + +\itemize{ +\item \code{dag} An \code{\link[sparsebnUtils]{edgeList}} containing the consensus network (11 nodes, 17 edges). +\item \code{data} A \code{\link{data.frame}} with 11 variables and 7466 observations. +\item \code{ivn} A \code{\link{list}} specifying which nodes are under intervention in each observation. +Compatible with the input to \code{\link[sparsebnUtils]{sparsebnData}}. +}} +\usage{ +data(cytometryContinuous) +} +\description{ +Data and network for analyzing the flow cytometry experiment +from \href{http://science.sciencemag.org/content/308/5721/523.long}{Sachs et al. (2005)}. +This dataset contains the raw measurements from these experiments. +} +\examples{ +# Create a valid sparsebnData object from the cytometry data +data(cytometryContinuous) +dat <- sparsebnData(cytometryContinuous$data, type = "c", ivn = cytometryContinuous$ivn) + +} +\keyword{datasets} + diff --git a/man/cytometryDiscrete.Rd b/man/cytometryDiscrete.Rd index f9cbda2..911b20c 100644 --- a/man/cytometryDiscrete.Rd +++ b/man/cytometryDiscrete.Rd @@ -3,7 +3,7 @@ \docType{data} \name{cytometryDiscrete} \alias{cytometryDiscrete} -\title{The cytometry network} +\title{The discrete cytometry network} \format{A \code{\link{list}} with three components: \itemize{ @@ -13,12 +13,18 @@ Compatible with the input to \code{\link[sparsebnUtils]{sparsebnData}}. }} \usage{ -cytometryDiscrete +data(cytometryDiscrete) } \description{ Data and network for analyzing the flow cytometry experiment from \href{http://science.sciencemag.org/content/308/5721/523.long}{Sachs et al. (2005)}. -The data is a cleaned and discretized version of the raw data from these experiments. +The data is a cleaned and discretized version of the raw data (see \code{\link{cytometryContinuous}}) from these experiments. +} +\examples{ +# Create a valid sparsebnData object from the cytometry data +data(cytometryDiscrete) +dat <- sparsebnData(cytometryDiscrete$data, type = "d", ivn = cytometryDiscrete$ivn) + } \keyword{datasets} diff --git a/man/estimate.covariance.Rd b/man/estimate.covariance.Rd index a7b313b..272ccab 100644 --- a/man/estimate.covariance.Rd +++ b/man/estimate.covariance.Rd @@ -28,4 +28,14 @@ For Gaussian data, the precision matrix corresponds to an undirected graphical m distribution. This undirected graph can be tied to the corresponding directed graphical model; see Sections 2.1 and 2.2 (equation (6)) of Aragam and Zhou (2015) for more details. } +\examples{ + +\dontrun{ +data(cytometryContinuous) +dat <- sparsebnData(cytometryContinuous$data, type = "d", ivn = cytometryContinuous$ivn) +estimate.covariance(dat) # estimate covariance +estimate.precision(dat) # estimate precision +} + +} diff --git a/man/estimate.dag.Rd b/man/estimate.dag.Rd index 55c9a3c..2889afc 100644 --- a/man/estimate.dag.Rd +++ b/man/estimate.dag.Rd @@ -49,4 +49,14 @@ combination of discrete / continuous and observational / experimental data. For details on the underlying methods, see \code{\link[ccdrAlgorithm]{ccdr.run}} and \code{\link[discretecdAlgorithm]{cd.run}}. } +\examples{ + +\dontrun{ +# Estimate a DAG from the cytometry data +data(cytometryContinuous) +dat <- sparsebnData(cytometryContinuous$data, type = "d", ivn = cytometryContinuous$ivn) +estimate.dag(dat) +} + +} diff --git a/man/pathfinder.Rd b/man/pathfinder.Rd index ad6db1a..da6be49 100644 --- a/man/pathfinder.Rd +++ b/man/pathfinder.Rd @@ -4,14 +4,42 @@ \name{pathfinder} \alias{pathfinder} \title{The pathfinder network} -\format{An \code{\link{edgeList}} object with 109 nodes and 195 edges.} +\format{A \code{\link{list}} with four components: + +\itemize{ +\item \code{dag} An \code{\link[sparsebnUtils]{edgeList}} containing the pathfinder network (109 nodes, 195 edges). +\item \code{data} A \code{\link{data.frame}} with 109 variables and 1000 observations. +\item \code{ivn} A \code{\link{list}} specifying which nodes are under intervention in each observation; since this dataset +is purely observational, this is just \code{NULL}. Compatible with the input to \code{\link[sparsebnUtils]{sparsebnData}}. +\item \code{cov} Covariance matrix used to generate the data. +}} \usage{ -pathfinder +data(pathfinder) } \description{ -The pathfinder network from the \href{http://www.bnlearn.com/bnrepository/#pathfinder}{Bayesian network repository}. -Pathfinder is an expert system developed in 1992 by Heckerman et. al +Simulated data and network for the pathfinder network from the \href{http://www.bnlearn.com/bnrepository/#pathfinder}{Bayesian network repository}. +Pathfinder is an expert system developed by Heckerman et. al (1992) to assist with the diagnosis of lymph-node diseases. +} +\details{ +The data is simulated from a Gaussian SEM assuming unit edge weights and +unit variances for all nodes. +} +\examples{ +\dontrun{ +# Create a valid sparsebnData object from the simulated pathfinder data +data(pathfinder) +dat <- sparsebnData(pathfinder$data, type = "c") + +# If desired, change the edge weights to be randomly generated +coefs <- get.adjacency.matrix(pathfinder$dag) +coefs[coefs != 0] <- runif(n = num.edges(pathfinderDAG), min = 0.5, max = 2) +vars <- Matrix::Diagonal(n = num.nodes(pathfinderDAG), x = rep(1, num.nodes(pathfinderDAG))) +id <- vars +covMat <- t(solve(id - coefs)) \%*\% vars \%*\% solve(id - coefs) +pathfinder.data <- rmvnorm(n = 1000, sigma = as.matrix(covMat)) +} + } \keyword{datasets} diff --git a/tests/testthat/test-dag.R b/tests/testthat/test-dag.R index 8bbf0ac..3c9c6d9 100644 --- a/tests/testthat/test-dag.R +++ b/tests/testthat/test-dag.R @@ -4,5 +4,17 @@ dat_cts <- generate_continuous_sparsebnData() dat_disc <- generate_discrete_sparsebnData() test_that("", { - ### Tests added here + ### More tests added here +}) + +test_that("DAG estimation runs without errors", { + expect_error(estimate.dag(dat_cts), NA) # continuous data + expect_error(estimate.dag(dat_disc), NA) # discrete data +}) + +test_that("estimate.dag returns expected output", { + out <- estimate.dag(dat_cts) + expect_is(out, "sparsebnPath") + expect_true(check_list_class(as.list(out), "sparsebnFit")) + expect_true(is.zero(out[[1]]$edges)) # first estimate should always be null }) diff --git a/vignettes/sparsebn-vignette.Rmd b/vignettes/sparsebn-vignette.Rmd index 3c19559..803a820 100644 --- a/vignettes/sparsebn-vignette.Rmd +++ b/vignettes/sparsebn-vignette.Rmd @@ -9,7 +9,26 @@ vignette: > %\VignetteEncoding{UTF-8} --- -Introduction goes here +`sparsebn` is an `R` package for learning sparse Bayesian networks and other graphical models from high-dimensional data via sparse regularization. Designed to handle: + +- Experimental data with interventions +- Mixed observational / experimental data +- High-dimensional data with _p >> n_ +- Datasets with thousands of variables (tested up to _p_=8000) +- Continuous and discrete data + +The workhorse behind [`sparsebn`](http://www.github.com/itsrainingdata/sparsebn/) is the [`sparsebnUtils`](http://www.github.com/itsrainingdata/sparsebnUtils/) +package, which provides various S3 classes and methods for representing and manipulating graphs. The basic algorithms are implemented in [`ccdrAlgorithm`](http://www.github.com/itsrainingdata/ccdrAlgorithm/) and [`discretecdAlgorithm`](http://www.github.com/gujyjean/discretecdAlgorithm/). + +## Overview + +The main methods for learning graphical models are: + +* `estimate.dag` for directed acyclic graphs (Bayesian networks). +* `estimate.precision` for undirected graphs (Markov random fields). +* `estimate.covariance` for covariance matrices. + +Currently, estimation of precision and covariances matrices is limited to Gaussian data. ## Example: Learning a Markov Chain @@ -17,7 +36,7 @@ Suppose that the data is generated by a simple Markov chain: $$X_1\to X_2\to X_3.$$ -Assume unit influences between variables, i.e. $X_j\sim\mathcal{N}(0,1)$ and $X_{j} = X_{j-1} + \varepsilon$ with $\varepsilon\sim\mathcal{N}(0,1)$ for $j>1$. If $X=(X_1,X_2,X_3)$ then $X=B^TX+\varepsilon\sim\mathcal{N}(0,\Sigma)$, where we use the following parameters: +Assume unit influences between variables, i.e. $X_j\sim\mathcal{N}(0,1)$ and $X_{j} = X_{j-1} + \varepsilon_j$ with $\varepsilon_j\sim\mathcal{N}(0,1)$ for $j>1$. If $X=(X_1,X_2,X_3)$ then $X=B^TX+\varepsilon\sim\mathcal{N}(0,\Sigma)$, where we use the following parameters: $$ B = \begin{pmatrix} @@ -48,6 +67,7 @@ covariance.matrix <- rbind(c(3,2,1), Then we can generate some data using `mvtnorm::rmvnorm`: ```{r} gaussian.data <- mvtnorm::rmvnorm(n = 100, mean = mean.vector, sigma = covariance.matrix) +colnames(gaussian.data) <- c("X1", "X2", "X3") ``` In order to use the methods in the sparsebn package, we need to indicate what kind of data we are working with by wrapping the data into a `sparsebnData` object: @@ -73,42 +93,55 @@ dags.out[[3]] get.adjacency.matrix(dags.out[[3]]) ``` -## Example: Bayesian network repository - -For a less trivial example, we will try to reconstruct the pathfinder network from the [Bayesian network repository](http://www.bnlearn.com/bnrepository/#pathfinder). The pathfinder network has 109 nodes and 195 edges. +We can also use this data to directly estimate the covariance matrix $\Sigma$: +```{r, warning=FALSE, message=FALSE} +cov.out <- estimate.covariance(data = dat) +``` -First, load the data and construct a covariance matrix: +Compared with the output of `estimate.dag`, which is a more complicated `sparsebnPath` object, the output of `estimate.covariance` (and also `estimate.precision`) is simply a list of matrices: +```{r} +class(cov.out) +``` +Let's take a look at the third estimate in the solution path (corresponding to the correct estimate of $B$ from before): ```{r} -data(pathfinder) -A <- as.matrix(get.adjacency.matrix(pathfinder$dag)) # pathfinder network as an adjacency matrix -id <- diag(rep(1, num.nodes(pathfinder$dag))) # 109x109 identity matrix -Sigma <- solve(t(id - A)) %*% id %*% solve(id - A) +cov.out[[3]] ``` -For simplicity we set all of the edge weights and variances to 1. Next, generate some random data: +If we increase our sample size to $n=1000$, the estimate gets closer to the truth: +```{r, warning=FALSE, message=FALSE} +gaussian.data <- mvtnorm::rmvnorm(n = 1000, mean = mean.vector, sigma = covariance.matrix) +dat <- sparsebnData(gaussian.data, type = "continuous") +cov.out <- estimate.covariance(data = dat) +cov.out[[3]] +``` + +## Example: Bayesian network repository + +For a less trivial example, we will try to reconstruct the pathfinder network from the [Bayesian network repository](http://www.bnlearn.com/bnrepository/#pathfinder). The pathfinder network has 109 nodes and 195 edges. + +First, load the data: ```{r} -set.seed(123) -nn <- 1000 -gdata <- mvtnorm::rmvnorm(nn, sigma = Sigma) +data(pathfinder) +dat <- sparsebnData(pathfinder$data, type = "c", ivn = NULL) ``` -Now we create a `sparsebnData` object (note that the data here has no interventions), and generate a large grid of regularization parameters for the solution path (see `?generate.lambdas`). +Note that we create a `sparsebnData` object with no interventions by setting `ivn = NULL`. This time, instead of automatically generating a grid of regularization parameters, we generate one manually (see `?generate.lambdas`). Then we run the algorithm: ```{r message=FALSE, warning=FALSE} -dat <- sparsebnData(gdata, type = "c", ivn = NULL) +nn <- num.samples(dat) # number of samples in the dataset / equivalent to nrow(dat$data) lambdas <- generate.lambdas(sqrt(nn), 0.05, lambdas.length = 50, scale = "linear") -out <- estimate.dag(data = dat, - lambdas = lambdas, - edge.threshold = 500, - verbose = FALSE) -out +dags.out <- estimate.dag(data = dat, + lambdas = lambdas, + edge.threshold = 500, + verbose = FALSE) +dags.out ``` Let's visualize the solution with 195 edges. For this, we use the `igraph` package: ```{r} -solution <- get.solution(out, edges=195) +solution <- get.solution(dags.out, edges = 195) plot(solution, layout = igraph::layout.circle(to_igraph(solution$edges)), vertex.label = NA, @@ -131,12 +164,76 @@ plot(pathfinder$dag, edge.arrow.size = .25) ``` +We can also use the parameter selection technique described in \[[4](#references)\] to automatically select a good index (see `?select.parameter` for more details): +```{r} +select.idx <- select.parameter(dags.out, dat) +solution <- get.solution(dags.out, index = select.idx) +plot(solution, + layout = igraph::layout.circle(to_igraph(solution$edges)), + vertex.label = NA, + vertex.size = 5, + vertex.color = gray(0.75), + edge.color = gray(0), + edge.width = 1, + edge.arrow.size = .25) +``` + + Note that the output of `estimate.dag` is a list of _graphs_, i.e. without weights. In order to do inference and estimate the weights in these graphs, use `estimate.parameters`: ```{r} -out.fit <- estimate.parameters(out, data = dat) +out.fit <- estimate.parameters(dags.out, data = dat) ``` The output is a list of weights, one for each value of $\lambda$. The weights are given in terms of $(B,\Omega)$, corresponding to the list `list(coefs, vars)`. For example, we can see how the weight of the first node on the second changes as we decrease $\lambda$: ```{r} unlist(lapply(out.fit, function(x) x$coefs[1,2])) ``` + +## Appendix: Simulating data from a linear Gaussian SEM + +The `pathfinder` dataset has been simulated from a linear Gaussian SEM. In this section, we illustrate how to do this from scratch. + +We first need a DAG; for this example we will use the pathfinder network from the previous section. First, load this DAG: + +```{r} +data(pathfinder) +B <- as.matrix(get.adjacency.matrix(pathfinder$dag)) # pathfinder network as an adjacency matrix +``` + +If $X=B^TX+\varepsilon\sim\mathcal{N}(0,\Sigma)$ and $\varepsilon\sim\mathcal{N}(0,\Omega)$, then one can show that: +$$ +\Sigma = (I-B)^{-T}\Omega(I-B)^{-1}. +$$ + +Assuming unit influences (i.e. $\beta_{ij}=1$ if $\beta_{ij}\ne 0$) and unit variances (i.e. $\omega_j^2=1$ for all $j$), we can then compute $\Sigma$ directly: + +```{r} +id <- diag(rep(1, num.nodes(pathfinder$dag))) # 109x109 identity matrix +Omega <- id # conditional variances +Sigma <- solve(t(id - B)) %*% Omega %*% solve(id - B) +``` + +Finally, we can use the `mvtnorm` package to generate random multivariate Gaussian data: + +```{r} +set.seed(123) +nn <- 1000 +gaussian.data <- mvtnorm::rmvnorm(nn, sigma = Sigma) +``` + +Instead of setting $\beta_{ij}=1$, we can also use random edge weights: +```{r} +B[B!=0] <- runif(n = num.edges(pathfinder$dag), min = 0.5, max = 2) +Sigma <- solve(t(id - B)) %*% Omega %*% solve(id - B) +gaussian.data <- mvtnorm::rmvnorm(nn, sigma = Sigma) +``` + +## References + +\[1\] Aragam, B. and Zhou, Q. (2015). [Concave penalized estimation of sparse Gaussian Bayesian networks.](http://jmlr.org/papers/v16/aragam15a.html) _The Journal of Machine Learning Research_. 16(Nov):2273−2328. + +\[2\] Fu, F., Gu, J., and Zhou, Q. (2014). [Adaptive penalized estimation of directed acyclic graphs from categorical data.](http://arxiv.org/abs/1403.2310) arXiv: 1403.2310. + +\[3\] Aragam, B., Amini, A. A., and Zhou, Q. (2015). [Learning directed acyclic graphs with penalized neighbourhood regression.](http://arxiv.org/abs/1511.08963) arXiv: 1511.08963. + +\[4\] Fu, F. and Zhou, Q. (2013). [Learning sparse causal Gaussian networks with experimental intervention: Regularization and coordinate descent.](http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2012.754359) Journal of the American Statistical Association, 108: 288-300.