This is an R
package to conduct the Stochastic Approximation Cut Algorithm. This repository also contains code that replicates the results in the paper.
Yang Liu and Robert Goudie
MRC Biostatistics Unit, University of Cambridge
Modern Bayesian modelling enables us to accommodate complex forms of data and make comprehensive inference. This facilitates the combination of highly structured models, but the potential effect of misspecification of the model is a concern: by integrating everything into a single model, the posterior distribution of the whole model may be affected even if only one part is not correctly specified. When the model is partially misspecified, recent studies have proposed the idea of modularized models in which the whole model is divided into distinct modules that are linked by Bayes' theorem (e.g. Lunn et al. (2009); Liu et al. (2009); Plummer (2015)). To prevent non-suspect modules being affected by the suspect modules, the 'cut model' has been proposed to prevent feedback of information from the suspect module. This leads to the 'cut distribution', which replaces the standard posterior. This distribution normally does not have a closed form. Previous studies have proposed algorithms to sample from this distribution, but these algorithms have unclear theoretical convergence properties.
To address this convergence problem, the Stochastic Approximation Cut algorithm (SACut) is proposed to draw samples from the cut distribution. The algorithm is divided into two chains that are run in parallel. The main chain targets an approximation of the cut distribution; the auxiliary chain forms a proposal distribution used in the main chain. The samples drawn by the proposed algorithm satisfy a weak law of large numbers. Although SACut is biased, since the main chain does not target the exact cut distribution, this bias can be reduced geometrically by increasing a user-chosen tuning parameter. In addition, parallel processing can be easily adopted for SACut, unlike existing algorithms. This greatly reduces computation time.
Please cite the following paper when using SACut:
Liu, Yang and Robert J.B. Goudie. “Stochastic Approximation Cut Algorithm for Inference in Modularized Bayesian Models.” Statistics and Computing 32, 7 (2022). Link
Simply download and install the SACut package from Github.
devtools::install_github('MathBilibili/Stochastic-approximation-cut-algorithm')
library('SACut')
SACut depends on the packages modeltools
, flexclust
, compiler
, doParallel
, dplyr
, tidyr
, progress
and data.table
which can be installed via:
install.packages(c("modeltools", "flexclust", "compiler", "doParallel", "dplyr", "tidyr","progress", "data.table"))
Additionally you may need truncnorm
and mvtnorm
to write likelihood or proposal distribution of truncated normal distribution and multivariate normal distribution, and coda
to assess the convergence of the Markov chain.
install.packages(c("truncnorm", "mvtnorm", "coda"))
Here we explore the usage of SACut on the HPV study discussed in Plummer (2015). First, we load the data and the dimension of parameter theta
and phi
Z <- c(7, 6, 10, 10, 1, 1, 10, 4, 35, 0, 10, 8, 4)
Npart <- c(111, 71, 162, 188, 145, 215, 166, 37, 173,143, 229, 696, 93)
Y <- c(16, 215, 362, 97, 76, 62, 710, 56, 133,28, 62, 413, 194)
Npop <- c(26983, 250930, 829348, 157775, 150467, 352445, 553066, 26751, 75815, 150302, 354993, 3683043, 507218)
d_x<-2 # dimension of theta
d_y<-13 # dimension of phi
where Z
is the number of people with HPV infection out of a sample of NPart
, and Y
is the number of cancer cases from T=0.001*Npop
person-years at 13 cities. Two modules are difined as:
SACut is applied to prevent the feedback from second module to the estimation of parameter phi
. At this step, you should define the joint distribution of phi
and Z
(first module) according to your task. For this particular demonstration, the log-density of the joint distribution (likelihood and prior) for first module is loaded:
px<-function(phi,Z){
PbZ<-rbind(phi,Z,Npart)
out<-sum(apply(PbZ,FUN=function(y){dbeta(x=y[1],shape1 = (1+y[2]), shape2 = (1+y[3]-y[2]), log = T)},MARGIN = 2))
return(out)
}
You should also define the joint distribution of Y
, theta
and phi
(second module) according to your task. For this particular demonstration, the log-density of the joint distribution (likelihood and prior) for the second module is loaded:
py<-function(Y,theta,phi){
PbY<-rbind(phi,Y,Npop)
out<- log(1/200)+sum(apply(PbY,FUN=function(y){dpois(x=y[2],lambda = (y[3]*0.001*exp(theta[1]+theta[2]*y[1])),log = T)},MARGIN = 2))
return(out)
}
We use the truncated normal distribution as the proposal distribution for parameter phi
and the log-density and random generation are:
prox<-function(phi_n,phi){
out<-sum(log(dtruncnorm(phi_n, a=0, b=1, mean = phi, sd = 0.005)))
return(out)
}
rprox<-function(phi){
out<-rtruncnorm(1, a=0, b=1, mean = phi, sd = 0.005)
return(out)
}
In the auxiliary chain, we use the multivariate normal distribution as the proposal distribution for auxiliary parameter theta
(note that, this is not the parameter theta
in the main chain), the density (not log) and random generation are:
proy<-function(theta_n,theta){
re<-dmvnorm(as.numeric(theta_n),mean = as.numeric(theta),sigma = matrix(c(0.1648181,-0.3979341,-0.3979341,2.737874),ncol=2,nrow=2)/10)
return(re)
}
rproy<-function(theta){
re<-rmvnorm(1,mean = theta,sigma = matrix(c(0.1648181,-0.3979341,-0.3979341,2.737874),ncol=2,nrow=2)/10) %>% t()
return(re)
}
Now we have defined every components of the cut distribution, then call function CutModel
to build the cut model:
cutmodel <- CutModel(px = px, py = py, prox = prox, rprox = rprox, proy = proy, rproy = rproy,
Z = Z, Y = Y, d_x = d_x, d_y = d_y)
Here we do not import px
and py
from other package. In the case that they are written and built within other R package for high computational speed:
devtools::install_github('MathBilibili/Stochastic-approximation-cut-algorithm/Examples/Example3/Plummer2015Example/')
library(Plummer2015Example)
cutmodel <- CutModel(px = px, py = py, prox = prox, rprox = rprox, proy = proy, rproy = rproy,
Z = Z, Y = Y, d_x = d_x, d_y = d_y, cpp_yes = TRUE, cpp_package = 'Plummer2015Example')
After setting the cut model, we set the parallel environment for the computation by function ComEnvir
. For a Windows device with 4 core, a cluster of PSOCK
is created. We also need claim the list of variables that are exported to clusters (no need for Linux).
comenvir <- ComEnvir(is_Unix = FALSE, core_num = 4, clusterExport = list('py','Y','Npop','dmvnorm'))
We then load the existing auxiliary parameter set phi0
from a .CSV
file by function LoadOldPhi0
, the file (PhiC.csv
) can be download from https://github.com/MathBilibili/Stochastic-approximation-cut-algorithm/tree/master/Examples/Example3. Note that, this parameter set can be created by calling function BuildNewPhi0
.
PhiC <- LoadOldPhi0(filename="PhiC.csv")
As suggested by Liang et al. (2016), the auxiliary chain is conducted solely with sufficient iterations so that it is converged when the main Markov chain (external chain) start to run. Hence, we conduct a preliminary run with 1501000 iterations by calling function Preliminary_SACut
. The shrink magnitude no
is set to be large enough to ensure that the auxiliary chain can completely go through every phi0
before it converges, acce_pa
is set to be 10 to speed up the convergence. The precision parameter is set to be 3 for theta_1
and 2 for theta_2
.
init<-list(theta=c(-2,13),phi=apply(PhiC,MARGIN = 2,median),t=as.matrix(c(-2,13)),I=1)
PreRun <- Preliminary_SACut(init=init, PhiC,numrun=1501000,auxrun=1500000,no=20000,acce_pa=10,
sig_dig=c(3,2), CutModel=cutmodel)
If you simply want to try the code without waiting, you can run following code instead
PreRun <- Preliminary_SACut(init=init, PhiC,numrun=250,auxrun=200,no=10,acce_pa=1,
sig_dig=c(3,2), CutModel=cutmodel)
Finally, we are able to run the auxiliary chain and the main chain in parallel by calling the function SACut
. The total number of iterations is 140000 and we retain only every 100 sample after discarding the first 40000 samples. The result is stored in file Result.csv
(updated every 1000 iterations).
SACut(pre_values=PreRun, PhiC=PhiC,numrun=140000,burnin=40000,thin=100, no=20000,acce_pa=10, sig_dig=c(3,2),
filename='Result.csv',storage_step=1000, Comenvir=comenvir, CutModel=cutmodel)
If you simply want to try the code without waiting, you can run following code instead
SACut(pre_values=PreRun, PhiC=PhiC,numrun=100,burnin=10,thin=1, no=10,acce_pa=1, sig_dig=c(3,2),
filename='Result.csv', print_theta=TRUE, Comenvir=comenvir, CutModel=cutmodel)