- Computation and Bayesian approach
- full probability modeling
- a.k.a. the Bayesian approach
- applying probability theory to a model derived from substantive knowledge
- able to deal w/ realistically complex situations
- computational difficulties:
w/ the specific problem being to carry out the integrations necessary to obtain the posterior distributions of quantities of interest in situations- non-standard prior distributions used
- additional nuisance parameters in the model
- solution: Markov Chain Monte Carlo (MCMC) methods
- full probability modeling
-
- issue: difficult to completely specify the prior density based on available expert opinion or background scientific information
- utilizing hyperparametes to resolve
-
${p_1(\theta|\lambda): \lambda \in \Lambda}$ : a class of prior densities -
$\lambda$ : hyper-parameter -
$p_1(\cdot | \cdot)$ : a conditional density of$\theta$ given$\lambda$
-
- methods to determine hyperparameter
$\lambda \to$ empirical estimate$\hat{\lambda}$ - Bayesian Hierarchical Model (BHM)
- Empirical Bayes (EB) method
- using
$p(\theta) = p_1(\theta | \hat{\lambda})$ as the prior for$\theta$ - criticizing: the prior density
$p(\theta)$ obtained via the data$Y$
-
- computing posterior summary estimates
- binomial sampling density:
- Zellner's prior:
$p(\theta) = C\theta^\theta (1-\theta)^{1-\theta}$ for$\theta \in (0, 1)$ and C = 1.61857 - Jeffery's prior:
$Beta(a = 0.5, b = 0.5)$
- Zellner's prior:
- Haldane's prior: improper prior
$p(\theta) = [\theta(1-\theta)]^{-1}$ - uniform prior: same as
$Beta(a=1, b=1)$ w/$\theta$ using binomial sampling density - posterior inference about
$\theta$ or odds ratio$\rho = \theta/(1-\theta)$ or log-odds ratio$\eta = \log \rho$ relative insensitive to all previous priors
-
Simulation methods for posterior distribution
- general procedure
- drawing
$\theta^1, \dots, \theta^B \sim p(\theta ,|, \mathcal{D}_n)$ - generating a histogram of
$\theta^1, \dots, \theta^B$ approximates the posterior density$p(\theta ,|, \mathcal{D}_n)$
- drawing
- approximation to the posterior mean: $\overline{\theta}_n = \mathbb{E}(\theta ,|, \mathcal{D}n) = B^{-1}\sum{j=1}^B \theta^j$
- approximation to posterior
$(1-\alpha)$ interval:$(\theta_{\alpha/2}, \theta_{1 - \alpha/2})$ w/$\theta_{\alpha/2}$ as the$\alpha/2$ sample quantile of$\theta^1, \dots, \theta^B$ -
$\exists; \theta^1, \dots, \theta^B$ from$p(\theta | \mathcal{D}_n)$ and$\tau^i = g(\theta^i) ,\forall, i=1, \dots, B \implies \tau^1, \dots, \tau^B$ as a sample from$p(\tau ,|, \mathcal{D}_n)$
- general procedure
-
Methods for obtaining simulated values from the posterior
- stochastic simulation methods
- based on random sampling
- typical approaches
- Monte Carlo integration
- importance sampling
- Markov chain Monte Carlo (MCMC)
- variational inference
- based on deterministic approximation and numerical optimization
- applied to wide range problems
- very weak theoretical guarantees
- stochastic simulation methods
-
Simulation approach for distribution
- drawing sample
$X$ from a distribution$F$ -
$F(X)$ : uniform distribution over the interval (0, 1) - a basic strategy to sample
$U \sim \text{Uniform}(0, 1) \to X = F^{-1}(U)$
-
- drawing sample
-
Integrals for posterior inference
-
issue: almost any posterior inference based on
$K(\theta; Y) \to$ high-dimensional integration w/$\Theta \in \mathbb{R}^m$ , where$\mathbb{R}^m$ as a Euclidean space -
ageneric function
$\mathfrak{g}(\theta)$ of the parameter$\theta$ , the posterior inference requiring the computation of the integral[ I = I(Y) \int \mathfrak{g}(\theta) K(\theta; Y) d\theta \tag{11} ]
-
-
Numerical methods for integrals
- numerical approaches
- classical (deterministic) numerical methods
- applied when
$m < 5$ - order of accuracy:
$O(N^{-2/m})$ - rate of convergence depending on
$m$ (curse of dimensionality)
- applied when
- stochastic integration methods
- applied when
$m \geq 5$ - known as Monte Carlo (MC) methods
- order of accuracy:
$O(^{-1/2})$ - rate of convergence not depending on
$m$
- applied when
- classical (deterministic) numerical methods
- performance: deterministic >> stochastic
- numerical approaches
-
Basic simulation approach to estimate integral
-
estimating the integral
$\int_0^1 h(x) ,dx$ for some complicated function$h$ -
drawing
$N$ samples$X_i \sim \text{Uniform}(01, 1)$ -
estimating the integral
[ \int_0^1 h(x) ,dx \approx \frac{1}{N} \sum_{i=1}^N h(X_i) ]
-
converged to the desired integral by the law of large numbers
-
-
Bayesian hierarchical Model (EBM)
-
assigning another prior distribution for
$\lambda$ :$\lambda \sim p_2(\lambda)$ -
$p_2(\cdot)$ : a probability density defined over$\Lambda$
-
$\therefore; p(\theta) = \int p_1(\theta|\lambda) p_2(\lambda)d\lambda$ : a more flexible prior for$\theta$ -
forming a hierarchical stages for the complete Bayes model
[ Y|\theta, \lambda \sim p(Y|\theta) \implies \theta|\lambda \sim p_1(\theta|\lambda) \quad &\quad \lambda \sim p_2(\lambda)]
-
often choosing
$p_2(\cdot)$ suitable to obtain estimators w/ good frequentist properties -
issue: inference based on
$p(\theta|Y)$ requiring integration both on$\theta$ and$\lambda$ -
solutions: Monte Carlo methods (in particular, Markov Chain Monte Carlo (MCMC) methods) used to obtain samples from the posterior density
$p(\theta|Y)$
-
-
-
based on marginal likelihood
$m(Y|\lambda) = \int p(Y|\theta) p_1(\theta|\lambda) d\theta$ -
$m(Y|\lambda)$ : the marginal likelihood function of$\lambda \implies$ maximizing[ \hat{\lambda} = \hat{\lambda}(Y) = \underset{\lambda \in \Lambda}{\mathrm{argmin}}; m(Y|\lambda) ]
-
alternatively, obtaining a moment based method to 'estimate'
$\lambda$ -
$\lambda$ : a$q$ -dimensional parameter - using a set of
$q$ suitable moments of$Y$ w.r.t. the marginal density$m(Y|\lambda)$ - equate moments to the corresponding
$q$ empirical moments of the data vector$Y = (y_1, y_2, \dots, y_n)$
-
-
-
Simulation methods for Bayesian inference
-
the posterior density
[ \pi(\theta ,|, \mathcal{D}_n) = \frac{\mathcal{L}_n(\theta)\pi(\theta)}{c} ]
-
$\mathcal{L}_n(\theta)$ : the likelihood function - the normalizing constant
$c = \int \mathcal{L}_n(\theta) \pi(\theta) ,d\theta$
-
-
the posterior mean
[ \overline{\theta} = \int \theta \pi(\theta ,|, \mathcal{D}_n) ,d\theta = \frac{1}{c} \int \theta\mathcal{L}_n(\theta) \pi(\theta) ,d\theta ]
-
the marginal posterior density for
$\theta_i$ [ \pi(\theta_i ,|, \mathcal{D}n) = \int\int\cdots\int \pi(\theta_1, \cdots, \theta_d) ,d\theta_1 \cdots d\theta{i-1} d\theta_{i+1} \cdots d\theta_s ]
-
-
Fundamentals of deterministic numerical methods
-
computing integrals usually depending on some adaptation of the quadratic rules
-
goal: computing the integral
[ I = \int_a^b h(\theta) d \theta ]
-
task: finding suitable weights
$w_i$ 's and ordered knot points$\theta_j$ 's$\to$ [ I = \int_a^b h(\theta)d\theta \approx I_N = \sum_{j=1}^N w_j h(\theta_j) \tag{12} ]
-
$\exists; N \to \infty: |I - I_n| \to 0$ -
different choices of the weights and the knot points
$\to$ different integration rules -
categories of the numerical quadrature rules
- Newton-Cotes rules: based on equally spaced knot points
- Gaussian quadrature rules
- based on knot points obtained via the roots of the orthogonal polynomials
- more accurate results, especially
$h(\cdot)$ approximates reasonably well by a sequence of polynomials
-
-
- simplest approaches
- trapezoidal rule: a two-point rule
- mid-point rule: another two point rule
- Simpson's rule: a three-point rule
- general rule:
- a
$(2k - 1)$ -point (or$2k$ -point) rule w/$N$ equally spaced knots w/ suitably chosen weights $\exists; \theta^\ast \in (a, b) \to |I - I_n| = C_k (b - a)^{2k+1} b^{2k}(\theta^\ast) / N^{2k}$ -
$C_K > 0$ : a constant
- a
- exact result: polynomials of degree
$N$ ($N = 2k+1$ ) or degree$(N-1)$ ($N=2k$ )$\to$ N knots
- simplest approaches
-
Gaussian quadrature rules
- not only suitably choosing the weights but also the knots at which the function
$h(\cdot)$ evaluated - exact result: polynomials of degree
$2N -1$ or less$\to N$ knots
- not only suitably choosing the weights but also the knots at which the function
-
Multi-dimensional numerical integration rules
-
multidimensional trapezoidal rule
[ I = I_N + O(N^{2/m}) ]
-
multidimensional Simpson's rule:
$O(N^{-4/m})$ -
Monte Carlo methods:
$O_p(N^{-1/2})$ -
accuracy of integration:
$m \nearrow ;;\to$ accuracy$\searrow$
-
-
[Summary of deterministic numerical integration methods ] (../Notes/p03-BayesianBasics.md#51-deterministic-methods)
- very accurate but declined rapidly as
$m \nearrow$ - invalid theoretical error estimate
- a complicated parameter space
- insufficiently smooth function
- very accurate but declined rapidly as
-
-
using another density to generate samples and then useing a weighted sample mean to approximate the posterior mean instead of
$K(\theta; Y)$ [ \int \mathfrak{g}(\theta) K(\theta; Y) d\theta = \int \mathfrak{g}(\theta) \frac{K(\theta; Y)}{q(\theta)} q(\theta) d\theta = \int \mathfrak{g}(\theta) w(\theta) q(\theta) d\theta \tag{15} ]
-
$q(\theta)$ : the importance proposal density$\to { \theta \in \Theta: K(\theta; Y) > 0 } \subseteq { \theta \in \Theta: q(\theta) > 0 }$ -
$w(\theta)$ : the importance weight function
-
-
algorithm to estimate
$I = I(Y)$ - generate
$\theta^{(l)} \stackrel{iid}{\sim} q(\theta)$ for$l = 1, 2, \cdots, N$ - compute the importance weights
$w^{(l)} = w(\theta^{(l)})$ for$l = 1, 2, \cdots, N$ - compute $\overline{I}N = \frac{1}{N} \sum{l=1}^N \mathfrak{g}(\theta^{(l)}w^{(l)})$
- generate
-
estimating the posterior mean
[ E[\mathfrak{g}\theta)|Y] \approx \overline{\mathfrak{g}}N = \frac{\sum{l=1}^N \mathfrak{g}(\theta^{(l)} w^{(l)})}{\sum_{l=1}^N w^{(l)}} \implies \frac{\sum_{l=1}^N w^{(l)}}{N} \xrightarrow[\text{in probability}]{\text{converge}} \int K(\theta; Y) d\theta ]
-
-
Modeling of importance sampling
-
$\exists; g$ a probability density w/ known distribution -
the integral
[ I = \int h(x) f(x) dx = \int \frac{h(x) f(x)}{g(x)} g(x) dx = \mathbb{E}_g (W) ]
$W = h(X)f(X)/g(X)$ -
$\mathbb{E}_g(W)$ : the expectation w.r.t$g$
-
importance sampling:
-
simulate
$X_1, \dots, X_N \sim g$ -
estimate
$I$ by the sample average[ \widehat{I} = \frac{1}{N} \sum_{i=1}^N Y_i = \frac{1}{N} \sum_{i=1}^N \frac{h(X_i)f(X_i)}{g(X_i)} ]
-
-
by the law of large number:
$\widehat{I} \xrightarrow{P} I$
-
-
Guideline to importance sampling
-
summary: a good choice for an importance sampling density
$g$ should be similar to$f$ but w/ thicker tails -
Theorem: the choice of
$g$ that minimizes the variance of$\widehat{I}$ is[ g^*(x) = \frac{|h(x)| f(x)}{\int |h(s)| f(s) ds} ]
-
-
Modeling importance sampling in Bayesian inference
-
$\theta_1, \dots, \theta_N$ as a sample from$g$ , the posterior mean[ \mathbb{E}[\theta ,|, X_1, \dots, X_n] \approx \frac{\frac{1}{N} \sum_{i=1}^N h_1(\theta_i)}{\frac{1}{N} \sum_{i=1}^N h_2(\theta_i)} ]
-
very difficult to choose a good importance sampler
$g$ , especially in high dimensions
-
-
- a toolkit of techniques aiming on evaluating integrals or sums by simulation rather than exact or approximate algebraic analysis
- a.k.a. probabilistic sensitivity analysis
- the simulated quantities passed into a standard spreadsheet and the resulting distributions of the outputs of the spreadsheet reflecting the uncertainty about the inputs
- used for Bayesian analysis provided the prior/current posterior distribution of concern is a member of a known family
- conjugate Bayesian analysis: possible to derive such a posterior distribution algebraically
- used to find tail areas or more usefully to find the distribution of complex functions of one or more unknown quantities as in the probabilistic sensitivity analysis
-
-
used in statistics and various fields to solve various problems
-
procedure
- generating pseudo-random numbers
- observing that sample analogue of the numbers converges to their population versions
-
useful for obtaining numerical solutions to problems which are too complicated to solve analytically or by using deterministic methods
-
relying on two celebrated results in Statistics
- The (Strong or Weak) Law of Large Numbers (LLN)
- The Central Limit Theorem (CLT)
-
generating
$\theta^{(l)} \stackrel{iid}{\sim} p(\theta | Y), ;; l=1,2,\cdots,N$ by using only$K(\theta; Y)$ -
w/ (Strong/Weak) LLN: as
$N \to \infty$ [ \overline{\mathfrak{g}}N = \frac{1}{N} \sum{l=1}^N \mathfrak{g}(\theta^{(l)}) \xrightarrow{p} E[\mathfrak{g}(\theta) | Y] = \int \mathfrak{g}(\theta) p(\theta | Y) d\theta \tag{13} ]
-
$N \nearrow \implies \overline{\mathfrak{g}}_N \to E[\mathfrak{g}(\theta) | Y]$ ;$\overline{\mathfrak{g}}_N$ = sample mean,$E[\mathfrak{g}(\theta)$ : population mean -
almost no smoothness condition required on the function
$\mathfrak{g}(\cdot)$ to apply the MC method
-
-
-
applying CLT to determine the least approximate value of
$N$ -
as
$N \to \infty$ [ \sqrt{N} (\overline{\mathfrak{g}}N - E[\mathfrak{g}(\theta) | Y]) \sim N\left(0, \sum{\mathfrak{g}}\right) \tag{14} ]
-
$\overline{\mathfrak{g}}_N = E[\mathfrak{g}(\theta) | Y] + O_p(N^{-1/2})$ and (stochastic) error bound: not depend on the dimension$m$ of$\theta$ -
Summary: an accuracy of
$\epsilon > 0$ with 95% CI$\implies$ [ N \geq \frac{4\hat{\sigma}_{\mathfrak{g}}^2}{\epsilon^2} ]
-
-
- error bound of integration methods:
- MC integration: probabilistic
- deterministic integration: fixed
- MC methods more advanced
- non-smooth
$\mathfrak{g}(\cdot)$ - complicated parameter space
- non-smooth
- error bound of integration methods:
-
-
$\exists, \text{ a function } h$ , to evaluate the integral[ I = \int_a^b h(x) dx ]
-
numerical techniques for evaluating
$I$ - Simpson's rule
- the trapezoidal rule
- Gaussian quadrature
- Monte Carlo integration
-
-
Basic Monte Carlo integration method
-
approximating
$I$ which is notable for its simplicity, generality and scalability -
$\exists; w(x) - h(x) (b-a), f(x) = 1/(b-a) \ni$ [ I = \int_a^b h(x) dx = \int_a^b w(x) f(x) dx ]
-
generating
$X_1, \dots, X_N \sim \text{Uniform}(a, b) \implies$ the the law of large numbers[ \widehat{I} \equiv \frac{1}{N} \sum_{i=1}^N w(X_i) \xrightarrow{P} \mathbb{E}(w(X)) = I ]
-
the standard error of the estimate
[ \widehat{se} = \frac{s}{\sqrt{N}}, \qquad s^2 = \frac{\sum_{i=1}^N (\widehat{Y}_i - \widehat{I})^2}{N - 1} ;;\text{ and };; Y_i = w(X_i) ]
-
region estimate:
$(1 - \alpha)$ confidence interval =$\widehat{I} \pm z_{\alpha/2} \cdot \widehat{se}$ -
$N \nearrow \implies CI \searrow$
-
-
Issue on basic Monte Carlo method
- issue: no guarantee
$\pi(\theta | \mathcal{D}_n)$ w/ a known distribution - solution: importance sampling - a generalization of basic Monte Carlo
- issue: no guarantee
-
Generalized Monte Carlo integration methods
-
the integral w/
$f(x)$ a probability density function[ I = \int_a^b h(x) f(x) dx ]
-
special case:
$f \sim \text{Uniform}(a, b)$ -
drawing
$X_1, \dots, X_N \sim f$ [ \widehat{I} := \frac{1}{N} \sum_{i=1}^N h(X_i) ]
-
-
Numerical methods for Gaussian density
-
the standard normal PDF
[ f(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2} ]
-
computing the CDF at some point
$x$ [ I = \int_{-\infty}^x f(s) ds = \Phi(x) \implies I = \int h(s) f(s) ds, \quad h(s) = \begin{cases} 1 & s < x \ 0 & s \geq x \end{cases} ]
-
generating
$X_1, \dots, X_N \sim N(0, 1)$ and[ \widehat{I} = \frac{1}{N} \sum_i h(X_i) = \frac{\text{number of observations w/ their values} \leq x}{N} ]
-
-
Bayesian inference for two binomial
- freqentist analysis
-
$X \sim \text{Binomial}(n, p_1), ,Y \sim \text{Binomial}(m, p_2)$ -
task: to estimate
$\delta = p_2 - p_1$ -
the MLE:
$\widehat{\delta} = \widehat{p}_1 - \widehat{p}_2 = (Y/m) - (X/n)$ -
the standard error
$\widehat{se}$ using the delta method[ \widehat{se} = \sqrt{\frac{\widehat{p}_1 (1 - \widehat{p}_1)}{n} + \frac{\widehat{p}_2 (1 - \widehat{p}_2)}{m}} ]
-
95% confidence interval:
$\widehat{\delta} \pm 2 \cdot \widehat{se}$
-
- Bayesian analysis
-
the prior:
$\pi(p_1, p_2) = \pi(p_1) \pi(p_2) = 1 \implies$ a flat prior on$(p_1, p_2)$ -
the posterior
[ \pi(p_1, p_2 ,|, X, Y) \propto p_1^X(1 - p_1)^{n - X} p_2^Y (1 - p_2)^{m-Y} ]
-
the posterior mean of
$\delta$ [ \overline{\delta} = \int_0^1\int_0^1 \delta(p_1, p_2) \pi(p_1, p_2 ,|, X, Y),dp_1 dp_2 = \int_0^1\int_0^1 (p_2 - p_1) \pi(p_1, p_2 ,|, X, Y) ,dp_1 dp_2 ]
-
obtaining the posterior CDF w/
$A = { (p_1, p_2): p_2 - p_1 \leq c }$ [ F(c ,|, X, Y) = \mathbb{P}(\delta \leq c ,|, X, Y) = \int_A \pi(p_1, p_2 ,|, X, Y) ]
-
then differentiate
$F \to$ too complicated
-
- simulation approach
-
$\pi(p_1, p_2 ,|, X, Y) = \pi(p_1 ,|, X) \pi(p_2 ,|, Y) \implies p_1, p_2$ independent under the posterior distribution -
simulate
$(P_1^{(1)}, P_2^{(1)}), \dots, (P_1^{(N)}, P_2^{(N)})$ from the posterior by drawing ($i = 1, \dots, N$ )[\begin{align*} P_1^{(i)} \sim \text{Beta}(X + 1, n - X + 1) \ P_2^{(i)} \sim \text{Beta}(Y + 1, n - Y + 1) \end{align*}]
-
$\delta^{(i)} = P_2^{(i)} - P_1{(i)} \implies$ [ \overline{\delta} \approx \frac{1}{N} \sum_{i=1}^N \delta^{(i)} ]
-
95% posterior interval for
$\delta$ :- sorting the simulated values
- finding the .025 and .975 quantile
-
the posterior density
$f(\delta ,|, X, Y)$ - applying density estimation techniques to
$\delta^{(1)}, \dots, \delta^{(N)}$ - plotting a histogram
- applying density estimation techniques to
-
- freqentist analysis
- Bayesian inference for multiple parameters
-
task: to estimate the dose at which the animals w/ 50% chance of dying, LD50
[ \delta = x_j^* \qquad j^* = \min{ j: p_j \geq 1/2 } ]
-
$\delta$ implicitly just a complicated function of$p_1, \dots, p_{10} \to \delta = g(p_1, \dots, p_{10})$ -
known
$(p_1, \dots, p_{10}) \implies$ find$\delta$ -
the posterior mean of
$\delta$ w/$A = { (p_1, \dots, p_{10}): p_1 \leq \cdots \leq p_{10}) }$ [ \int\int\cdots\int_A g(p_1, \dots, p_{10})\pi(p_1, \dots, p_{10} ,|, Y_1, \dots, Y_{10}) ,dp_1 dp_2 \cdots dp_{10} ]
-
the posterior CDF of
$\delta$ w/$B = A \cup { (p_1, \dots, p_{10}): g(p_1, \dots, p_{10}) \leq c }$ [\begin{align*} F(c ,|, Y_1, \dots, Y_{10}) &= \mathbb{P}(\delta \leq c ,|, Y_1, \dots, Y_{10}) \ &= \int\int \cdots\int_B \pi(p_1, \cdots, p_{10} ,|, Y_1, \dots, Y_{10}) dp_1 dp_2 \cdots dp_{10} \end{align*}]
-
simulation procedure
-
taking a flat prior truncated over
$A$ -
each
$P_i \sim$ Beta distribution -
drawing from the posterior
- drawing
$P_i \sim \text{Beta}(Y_i + 1, ,n - Y_i +1), ;i = 1, \dots, 10$ - if
$P_1 \leq P_2 \leq \dots \leq P_{10}$ keeping this draw. Otherwise, throw it away and draw again until getting one to keep - let
$\delta = x_j^\ast$ w/$j^\ast = \min { j: P_j > 1/2 }$
- drawing
-
repeat
$N$ times to get$\delta{(1)}, \dots, \delta^{(N)}$ and take[ \mathbb{E}(\delta ,|, Y_1, \dots, Y_{10}) \approx \frac{1}{N} \sum_{i=1}^N \delta^{(i)} ]
-
estimate the probability mass function as
$\delta$ a discrete variable[ \mathbb{P}(\delta = x_j ,|, Y_1, \dots, Y_{10}) \approx \frac{1}{N} \sum_{i=1}^N I(\delta^{(i)} = x_j) ]
-
-
-
Non-standard distributions and Bayesian approach
- non-conjugate distribution or nuisance parameters
- more complex Bayesian analysis
- not possible to derive the posterior distribution in an algebraic form
- Markov chain Monte Carlo methods: developed as a remarkably effective means of sampling from the posterior distribution of interest w/o knowing its algebraic form
- non-conjugate distribution or nuisance parameters
-
Essential components of MCMC methods
- replacing analytic methods by simulation
- sampling from the posterior distribution
- starting the simulation
- checking convergence
-
Markov Chain Monte Carlo (MCMC) methods
- to resolve the choice of
$q(\theta)$ , in particular, high dimensional$\theta$ - used to generate (dependent) samples from the posterior distribution, only using the posterior kernel
$K(\theta; Y)$ - the posterior distribution: stationary distribution of
${\theta^{(l)}: l = 1, 2, \dots}$ - creating a suitable transition kernel of a Markov Chain
- the Metropolis algorithm: most popular approaches
- the Metropolis-Hasting algorithm: generalized version
- generated samples
- MCMC method: dependent
- MC methods: independent sample
- MCMC samples
- not directly obtained from the posterior distribution
- obtained in an asymptotic sense
$\to$ discarding the first thousand samples ${\theta^{(l)}: l = 1, 2, \dots} \to {\theta^{(l)}: l = B+1, B+2, \dots, B > 1}$
- estimating
$\eta = \eta(\theta)$ based on samples$\theta^{(l)} \implies$ computing$\overline{\eta} = \sum_{l=B+1}^N \eta^{(l)} / (N-B) \approx E[\eta|Y], \exists; N >> B >> 1$
- to resolve the choice of
-
Markov chain Monte Carlo methods
-
constructing a Markov chain
$X_1, X_2, \dots$ w/$f$ as their stationary distribution -
problem: estimating the integral
$I = \int h(x) f(x) dx$ -
under certain conditions, the integral following
[ \frac{1}{N} \sum_{i=1}^N h(X_i) \xrightarrow{P} \mathbb{E}_f(h(X)) = I ]
-
-
Interpretation of Markov chain
- a Markov chain w/ the transition kernel
$p(y|x)$ = the probability of transiting from$x$ to$y$ -
$f(x) = \int f(y) p(x|y) dy \implies f$ is a stationary distribution for the Markov chain -
$f$ as a stationary distribution for a Markov chain$\implies$ the data from a sample run of the Markov chain approximating the distribution$f$ - designing a Markov chain w/ stationary distribution
$f \implies$ running the Markov chain and using the resulting data as if it were a sample from$f$
- a Markov chain w/ the transition kernel
-
Metropolis-Hastings algorithm
-
algorithm
choose$X_0$ arbitraryGiven$X_0, X_1, \dots, X_i$ , generate$X_{i+1}$ as following:1. generate a proposal or candidate value$Y \sim q(y | X_i)$ 2. Evaluate$r \equiv r(X_i, Y)$ where$r(x, y) = \min \{ \frac{f(y)}{f(x)}\,\frac{q(x|y)}{q(y|x)}, \; 1 \}$ 3. Set$X_{i+1} = \begin{cases} Y & \text{with probability } r \\ X_i & \text{with probability } 1 - r \end{cases}$
-
simple way to execute step 3: generate
$U \sim \text{Uniform}(0, 1)$ -
common choice:
-
$q(Y|x) \sim N(x, b^2), ,b >0 \ni$ the proposal drawing from a normal, centered at the current value -
the proposal density
$q$ symmetric$\implies q(y|x) = q(x|y)$ and[ r = \min\left{ \frac\left{f(Y)}{f(X_i)}, ,1 \right} ]
-
-
constructed
$X_0, X_1, \dots \to$ Markov chain -
the chain mixing well: the sample from the Markov chain starts look like the target distribution
$f$ quickly- tuning parameter (
$b$ )$\implies$ the efficiency and mixing-well of the chain
- tuning parameter (
-
-
Random-Walk-Metropolis-Hastings method
-
drawing a proposal
$Y$ of the form[ Y = X_i + \epsilon_i \implies q(y|x) = g(y-x) ]
-
the acceptance probability
[ r(x, y) = \min \left{ 1, \frac{f(y)}{f(x)} \right} ]
-
common choice:
$g \sim N(0, b^2)$ -
issue: choosing
$b$ to mix well for the Markov chain -
rule of thumb: choosing
$b \ni$ about 50% of the proposal are accepted
-
-
Independence-Metropolis-Hastings method
-
$g$ chosen to be an approximation to$f$ -
the acceptance probability
[ r(x, y) = \min \left{ 1, \frac{f(y)}{f(x)} \frac{g(x)}{f(y)} \right} \ \min \left{ 1, \frac{f(y)}{g(y)} \frac{g(x)}{f(x)} \right} ]
-
-
- a way to turn a high-dimensional problem into several one-dimensional problems
- a bivariate problem:
$(X, Y)$ w/ density$f_{X, Y}(x, y)$ -
the acceptance probability for the Metropolis-Hastings algorithm
[ r((X_n, Y_n), (X_n, Y)) = \min \left{ 1, \frac{f(X_n, Y)}{f(X_n, Y_n)} \frac{f_{Y|X}(Y_n|X_n)}{f_{Y|X}(Y|X_n)} \right} ]
-
iterating until convergence
[\begin{align*} X_{n+1} &\sim f_{X|Y}(x ,|, Y_n) \ Y_{n+1} &\sim f_{Y|X}(y ,|, X_{n+1}) \end{align*}]
-
-
Normal hierarchical model w/ Gibbs sampling
-
making some transformations to use some normal approximations
$\widehat{p}_i = Y_i / n_i \approx N(p_i, s_i), \quad s_i = \sqrt{\widehat{p}_i(1 - \widehat{p}_i)/n_i} $ $\psi_i = \log\left(p_i/(1-p_i)\right), \quad Z \equiv \widehat{\psi}_i = \log(\widehat{p}_i / (1 - \widehat{p}_i))$
-
by delta method
[ \widehat{\psi}_i \approx N(\psi_i, \sigma_i^2), \qquad \sigma_i^2 = \frac{1}{n\widehat{p}_i (1 - \widehat{p}_i)} ]
-
the normal approximation for
$\psi$ more accurate than the normal approximation for$p$ -
the hierarchical model
[ \psi_i \sim N(\mu, \tau^2), \qquad Z_i ,|, \psi_i \sim N(\psi_i, ,\sigma_i^2) ]
-
the likelihood function
[ \mathcal{L}_n(\theta) \propto \prod_i \exp\left( -\frac{1}{2}(\psi_i - \mu)^2 \right) \exp\left( -\frac{1}{2\sigma_i^2}(Z_i - \psi_i)^2 \right) ]
-
the prior
$f(\mu) \propto 1 \implies$ the posterior proportional to the likelihood -
using Gibbs sampling to find the conditional distribution of each parameter conditional on all the others,
$f(\mu ,|, \text{rest})$ - ignoring the terms not related to
$\mu$ :$\mu ,|, \text{rest} \sim N(b, ,1/k)$
- ignoring the terms not related to
-
fining
$f(\psi ,|, \text{rest})$ by ignoring any terms irrelevant to$\psi$ :$\psi_i ,|, \text{rest} \sim N(e_i, ,d_i^2)$ -
the Gibbs sampling algorithm involving iterating the following steps
$N$ times[\begin{array}{rcl} \text{draw } \mu & \sim & N(b, ,v^2)\ \text{draw } \psi_1 & \sim & N(e_1, ,d_1^2) \ & \vdots & \ \text{draw } \psi_k & \sim & N(e_k, ,d_k^2) \end{array}]
-
at each step, the most recently drawn version of each variable used
-
-
Gibbs sampling w/ Metropolis-Hastings
-
not knowing how to draw samples
$\to$ using the Gibbs sampling algorithm by drawing each observation using a Metropolis-Hastings step -
algorithm
(1a) draw a proposal$Z \sim q(z \,|\, X_n)$ (1b) Evaluate[ r = \min\left{ \frac{f(Z, Y_n)}{f(X_n, Y_n)} \frac{q(X_n | Z)}{q(Z | X_n)}, ;1 \right} ]
(1c) Set[ X_{n+1} = \begin{cases} Z & \text{with probability } r \ X_n & \text{with probability } 1- r \end{cases} ]
(2a) Draw a proposal$Z \sim \tilde{q}(z \,|\, Y_n)$ (2b) Evaluate[ r = \min \left{ \frac{f(X_{n+1}, Z)}{f(X_{n+1}, Y_n)} \frac{\tilde{q}(Y_n | Z)}{\tilde{q}(Z | Y_n)}, ; 1 \right} ]
(2c) Set[ Y_{n+1} = \begin{cases} Z & \text{with probability } r \ Y_n & \text{text probability } 1 - r \end{cases} ]
-
step (1) (and similar step (2)), w/
$Y_n$ fixed, sampling from$f(Z | Y_n)$ equivalent to sampling from$f(Z, Y_n)$ , as the ratios are identical[ \frac{f(Z, Y_n)}{f(X_n, Y_n)} = \frac{f(Z | Y_n)}{f(X_n | Y_n)} ]
-