Skip to content

Latest commit

 

History

History
631 lines (421 loc) · 30.4 KB

Stat-Numerical.md

File metadata and controls

631 lines (421 loc) · 30.4 KB

Statistics: Numerical Approaches

Overview

  • Computation and Bayesian approach
    • full probability modeling
      • a.k.a. the Bayesian approach
      • applying probability theory to a model derived from substantive knowledge
      • able to deal w/ realistically complex situations
    • computational difficulties:
      w/ the specific problem being to carry out the integrations necessary to obtain the posterior distributions of quantities of interest in situations
      • non-standard prior distributions used
      • additional nuisance parameters in the model
    • solution: Markov Chain Monte Carlo (MCMC) methods

Numerical Methods

  • Un-pure Bayesian method

    • issue: difficult to completely specify the prior density based on available expert opinion or background scientific information
    • utilizing hyperparametes to resolve
      • ${p_1(\theta|\lambda): \lambda \in \Lambda}$: a class of prior densities
      • $\lambda$: hyper-parameter
      • $p_1(\cdot | \cdot)$: a conditional density of $\theta$ given $\lambda$
    • methods to determine hyperparameter $\lambda \to$ empirical estimate $\hat{\lambda}$
      • Bayesian Hierarchical Model (BHM)
      • Empirical Bayes (EB) method
    • using $p(\theta) = p_1(\theta | \hat{\lambda})$ as the prior for $\theta$
    • criticizing: the prior density $p(\theta)$ obtained via the data $Y$
  • Numerical methods

    • computing posterior summary estimates
    • binomial sampling density:
      • Zellner's prior: $p(\theta) = C\theta^\theta (1-\theta)^{1-\theta}$ for $\theta \in (0, 1)$ and C = 1.61857
      • Jeffery's prior: $Beta(a = 0.5, b = 0.5)$
    • Haldane's prior: improper prior $p(\theta) = [\theta(1-\theta)]^{-1}$
    • uniform prior: same as $Beta(a=1, b=1)$ w/ $\theta$ using binomial sampling density
    • posterior inference about $\theta$ or odds ratio $\rho = \theta/(1-\theta)$ or log-odds ratio $\eta = \log \rho$ relative insensitive to all previous priors
  • Simulation methods for posterior distribution

    • general procedure
      • drawing $\theta^1, \dots, \theta^B \sim p(\theta ,|, \mathcal{D}_n)$
      • generating a histogram of $\theta^1, \dots, \theta^B$ approximates the posterior density $p(\theta ,|, \mathcal{D}_n)$
    • approximation to the posterior mean: $\overline{\theta}_n = \mathbb{E}(\theta ,|, \mathcal{D}n) = B^{-1}\sum{j=1}^B \theta^j$
    • approximation to posterior $(1-\alpha)$ interval: $(\theta_{\alpha/2}, \theta_{1 - \alpha/2})$ w/ $\theta_{\alpha/2}$ as the $\alpha/2$ sample quantile of $\theta^1, \dots, \theta^B$
    • $\exists; \theta^1, \dots, \theta^B$ from $p(\theta | \mathcal{D}_n)$ and $\tau^i = g(\theta^i) ,\forall, i=1, \dots, B \implies \tau^1, \dots, \tau^B$ as a sample from $p(\tau ,|, \mathcal{D}_n)$
  • Methods for obtaining simulated values from the posterior

    • stochastic simulation methods
      • based on random sampling
      • typical approaches
        • Monte Carlo integration
        • importance sampling
        • Markov chain Monte Carlo (MCMC)
    • variational inference
      • based on deterministic approximation and numerical optimization
      • applied to wide range problems
      • very weak theoretical guarantees
  • Simulation approach for distribution

    • drawing sample $X$ from a distribution $F$
      • $F(X)$: uniform distribution over the interval (0, 1)
      • a basic strategy to sample $U \sim \text{Uniform}(0, 1) \to X = F^{-1}(U)$

Numerical for Integrals

  • Integrals for posterior inference

    • issue: almost any posterior inference based on $K(\theta; Y) \to$ high-dimensional integration w/ $\Theta \in \mathbb{R}^m$, where $\mathbb{R}^m$ as a Euclidean space

    • ageneric function $\mathfrak{g}(\theta)$ of the parameter $\theta$, the posterior inference requiring the computation of the integral

      [ I = I(Y) \int \mathfrak{g}(\theta) K(\theta; Y) d\theta \tag{11} ]

  • Numerical methods for integrals

    • numerical approaches
      • classical (deterministic) numerical methods
        • applied when $m < 5$
        • order of accuracy: $O(N^{-2/m})$
        • rate of convergence depending on $m$ (curse of dimensionality)
      • stochastic integration methods
        • applied when $m \geq 5$
        • known as Monte Carlo (MC) methods
        • order of accuracy: $O(^{-1/2})$
        • rate of convergence not depending on $m$
    • performance: deterministic >> stochastic
  • Basic simulation approach to estimate integral

    • estimating the integral $\int_0^1 h(x) ,dx$ for some complicated function $h$

    • drawing $N$ samples $X_i \sim \text{Uniform}(01, 1)$

    • estimating the integral

      [ \int_0^1 h(x) ,dx \approx \frac{1}{N} \sum_{i=1}^N h(X_i) ]

    • converged to the desired integral by the law of large numbers

Bayesian Hierarchical Model

  • Bayesian hierarchical Model (EBM)
    • assigning another prior distribution for $\lambda$:

      • $\lambda \sim p_2(\lambda)$
      • $p_2(\cdot)$: a probability density defined over $\Lambda$
    • $\therefore; p(\theta) = \int p_1(\theta|\lambda) p_2(\lambda)d\lambda$: a more flexible prior for $\theta$

    • forming a hierarchical stages for the complete Bayes model

      [ Y|\theta, \lambda \sim p(Y|\theta) \implies \theta|\lambda \sim p_1(\theta|\lambda) \quad &\quad \lambda \sim p_2(\lambda)]

    • often choosing $p_2(\cdot)$ suitable to obtain estimators w/ good frequentist properties

    • issue: inference based on $p(\theta|Y)$ requiring integration both on $\theta$ and $\lambda$

    • solutions: Monte Carlo methods (in particular, Markov Chain Monte Carlo (MCMC) methods) used to obtain samples from the posterior density $p(\theta|Y)$

Empirical Bayes Method

  • Empirical Bayes (EB) method

    • based on marginal likelihood $m(Y|\lambda) = \int p(Y|\theta) p_1(\theta|\lambda) d\theta$

    • $m(Y|\lambda)$: the marginal likelihood function of $\lambda \implies$ maximizing

      [ \hat{\lambda} = \hat{\lambda}(Y) = \underset{\lambda \in \Lambda}{\mathrm{argmin}}; m(Y|\lambda) ]

    • alternatively, obtaining a moment based method to 'estimate' $\lambda$

      • $\lambda$: a $q$-dimensional parameter
      • using a set of $q$ suitable moments of $Y$ w.r.t. the marginal density $m(Y|\lambda)$
      • equate moments to the corresponding $q$ empirical moments of the data vector $Y = (y_1, y_2, \dots, y_n)$
  • Simulation methods for Bayesian inference

    • the posterior density

      [ \pi(\theta ,|, \mathcal{D}_n) = \frac{\mathcal{L}_n(\theta)\pi(\theta)}{c} ]

      • $\mathcal{L}_n(\theta)$: the likelihood function
      • the normalizing constant $c = \int \mathcal{L}_n(\theta) \pi(\theta) ,d\theta$
    • the posterior mean

      [ \overline{\theta} = \int \theta \pi(\theta ,|, \mathcal{D}_n) ,d\theta = \frac{1}{c} \int \theta\mathcal{L}_n(\theta) \pi(\theta) ,d\theta ]

    • the marginal posterior density for $\theta_i$

      [ \pi(\theta_i ,|, \mathcal{D}n) = \int\int\cdots\int \pi(\theta_1, \cdots, \theta_d) ,d\theta_1 \cdots d\theta{i-1} d\theta_{i+1} \cdots d\theta_s ]

Deterministic Methods

  • Fundamentals of deterministic numerical methods

    • computing integrals usually depending on some adaptation of the quadratic rules

    • goal: computing the integral

      [ I = \int_a^b h(\theta) d \theta ]

    • task: finding suitable weights $w_i$'s and ordered knot points $\theta_j$'s $\to$

      [ I = \int_a^b h(\theta)d\theta \approx I_N = \sum_{j=1}^N w_j h(\theta_j) \tag{12} ]

    • $\exists; N \to \infty: |I - I_n| \to 0$

    • different choices of the weights and the knot points $\to$ different integration rules

    • categories of the numerical quadrature rules

      • Newton-Cotes rules: based on equally spaced knot points
      • Gaussian quadrature rules
        • based on knot points obtained via the roots of the orthogonal polynomials
        • more accurate results, especially $h(\cdot)$ approximates reasonably well by a sequence of polynomials
  • Newton-Cotes rules

    • simplest approaches
      • trapezoidal rule: a two-point rule
      • mid-point rule: another two point rule
      • Simpson's rule: a three-point rule
    • general rule:
      • a $(2k - 1)$-point (or $2k$-point) rule w/ $N$ equally spaced knots w/ suitably chosen weights
      • $\exists; \theta^\ast \in (a, b) \to |I - I_n| = C_k (b - a)^{2k+1} b^{2k}(\theta^\ast) / N^{2k}$
      • $C_K > 0$: a constant
    • exact result: polynomials of degree $N$ ($N = 2k+1$) or degree $(N-1)$ ($N=2k$) $\to$ N knots
  • Gaussian quadrature rules

    • not only suitably choosing the weights but also the knots at which the function $h(\cdot)$ evaluated
    • exact result: polynomials of degree $2N -1$ or less $\to N$ knots
  • Multi-dimensional numerical integration rules

    • multidimensional trapezoidal rule

      [ I = I_N + O(N^{2/m}) ]

    • multidimensional Simpson's rule: $O(N^{-4/m})$

    • Monte Carlo methods: $O_p(N^{-1/2})$

    • accuracy of integration: $m \nearrow ;;\to$ accuracy $\searrow$

  • [Summary of deterministic numerical integration methods ] (../Notes/p03-BayesianBasics.md#51-deterministic-methods)

    • very accurate but declined rapidly as $m \nearrow$
    • invalid theoretical error estimate
      • a complicated parameter space
      • insufficiently smooth function

Importance Sampling

  • Importance sampling

    • using another density to generate samples and then useing a weighted sample mean to approximate the posterior mean instead of $K(\theta; Y)$

      [ \int \mathfrak{g}(\theta) K(\theta; Y) d\theta = \int \mathfrak{g}(\theta) \frac{K(\theta; Y)}{q(\theta)} q(\theta) d\theta = \int \mathfrak{g}(\theta) w(\theta) q(\theta) d\theta \tag{15} ]

      • $q(\theta)$: the importance proposal density $\to { \theta \in \Theta: K(\theta; Y) > 0 } \subseteq { \theta \in \Theta: q(\theta) > 0 }$
      • $w(\theta)$: the importance weight function
    • algorithm to estimate $I = I(Y)$

      1. generate $\theta^{(l)} \stackrel{iid}{\sim} q(\theta)$ for $l = 1, 2, \cdots, N$
      2. compute the importance weights $w^{(l)} = w(\theta^{(l)})$ for $l = 1, 2, \cdots, N$
      3. compute $\overline{I}N = \frac{1}{N} \sum{l=1}^N \mathfrak{g}(\theta^{(l)}w^{(l)})$
    • estimating the posterior mean

      [ E[\mathfrak{g}\theta)|Y] \approx \overline{\mathfrak{g}}N = \frac{\sum{l=1}^N \mathfrak{g}(\theta^{(l)} w^{(l)})}{\sum_{l=1}^N w^{(l)}} \implies \frac{\sum_{l=1}^N w^{(l)}}{N} \xrightarrow[\text{in probability}]{\text{converge}} \int K(\theta; Y) d\theta ]

  • Modeling of importance sampling

    • $\exists; g$ a probability density w/ known distribution

    • the integral

      [ I = \int h(x) f(x) dx = \int \frac{h(x) f(x)}{g(x)} g(x) dx = \mathbb{E}_g (W) ]

      • $W = h(X)f(X)/g(X)$
      • $\mathbb{E}_g(W)$: the expectation w.r.t $g$
    • importance sampling:

      • simulate $X_1, \dots, X_N \sim g$

      • estimate $I$ by the sample average

        [ \widehat{I} = \frac{1}{N} \sum_{i=1}^N Y_i = \frac{1}{N} \sum_{i=1}^N \frac{h(X_i)f(X_i)}{g(X_i)} ]

    • by the law of large number: $\widehat{I} \xrightarrow{P} I$

  • Guideline to importance sampling

    • summary: a good choice for an importance sampling density $g$ should be similar to $f$ but w/ thicker tails

    • Theorem: the choice of $g$ that minimizes the variance of $\widehat{I}$ is

      [ g^*(x) = \frac{|h(x)| f(x)}{\int |h(s)| f(s) ds} ]

  • Modeling importance sampling in Bayesian inference

    • $\theta_1, \dots, \theta_N$ as a sample from $g$, the posterior mean

      [ \mathbb{E}[\theta ,|, X_1, \dots, X_n] \approx \frac{\frac{1}{N} \sum_{i=1}^N h_1(\theta_i)}{\frac{1}{N} \sum_{i=1}^N h_2(\theta_i)} ]

    • very difficult to choose a good importance sampler $g$, especially in high dimensions

Monte Carlo Methods

  • Monte Carlo methods

    • a toolkit of techniques aiming on evaluating integrals or sums by simulation rather than exact or approximate algebraic analysis
    • a.k.a. probabilistic sensitivity analysis
    • the simulated quantities passed into a standard spreadsheet and the resulting distributions of the outputs of the spreadsheet reflecting the uncertainty about the inputs
    • used for Bayesian analysis provided the prior/current posterior distribution of concern is a member of a known family
    • conjugate Bayesian analysis: possible to derive such a posterior distribution algebraically
    • used to find tail areas or more usefully to find the distribution of complex functions of one or more unknown quantities as in the probabilistic sensitivity analysis
  • MC statistical methods

    • used in statistics and various fields to solve various problems

    • procedure

      • generating pseudo-random numbers
      • observing that sample analogue of the numbers converges to their population versions
    • useful for obtaining numerical solutions to problems which are too complicated to solve analytically or by using deterministic methods

    • relying on two celebrated results in Statistics

      • The (Strong or Weak) Law of Large Numbers (LLN)
      • The Central Limit Theorem (CLT)
    • generating $\theta^{(l)} \stackrel{iid}{\sim} p(\theta | Y), ;; l=1,2,\cdots,N$ by using only $K(\theta; Y)$

    • w/ (Strong/Weak) LLN: as $N \to \infty$

      [ \overline{\mathfrak{g}}N = \frac{1}{N} \sum{l=1}^N \mathfrak{g}(\theta^{(l)}) \xrightarrow{p} E[\mathfrak{g}(\theta) | Y] = \int \mathfrak{g}(\theta) p(\theta | Y) d\theta \tag{13} ]

    • $N \nearrow \implies \overline{\mathfrak{g}}_N \to E[\mathfrak{g}(\theta) | Y]$; $\overline{\mathfrak{g}}_N$ = sample mean, $E[\mathfrak{g}(\theta)$: population mean

    • almost no smoothness condition required on the function $\mathfrak{g}(\cdot)$ to apply the MC method

  • Sample size

    • applying CLT to determine the least approximate value of $N$

    • as $N \to \infty$

      [ \sqrt{N} (\overline{\mathfrak{g}}N - E[\mathfrak{g}(\theta) | Y]) \sim N\left(0, \sum{\mathfrak{g}}\right) \tag{14} ]

    • $\overline{\mathfrak{g}}_N = E[\mathfrak{g}(\theta) | Y] + O_p(N^{-1/2})$ and (stochastic) error bound: not depend on the dimension $m$ of $\theta$

    • Summary: an accuracy of $\epsilon > 0$ with 95% CI $\implies$

      [ N \geq \frac{4\hat{\sigma}_{\mathfrak{g}}^2}{\epsilon^2} ]

  • Performance of MC methods

    • error bound of integration methods:
      • MC integration: probabilistic
      • deterministic integration: fixed
    • MC methods more advanced
      • non-smooth $\mathfrak{g}(\cdot)$
      • complicated parameter space
  • Numerical method for integral

    • $\exists, \text{ a function } h$, to evaluate the integral

      [ I = \int_a^b h(x) dx ]

    • numerical techniques for evaluating $I$

      • Simpson's rule
      • the trapezoidal rule
      • Gaussian quadrature
      • Monte Carlo integration
  • Basic Monte Carlo integration method

    • approximating $I$ which is notable for its simplicity, generality and scalability

    • $\exists; w(x) - h(x) (b-a), f(x) = 1/(b-a) \ni$

      [ I = \int_a^b h(x) dx = \int_a^b w(x) f(x) dx ]

    • generating $X_1, \dots, X_N \sim \text{Uniform}(a, b) \implies$ the the law of large numbers

      [ \widehat{I} \equiv \frac{1}{N} \sum_{i=1}^N w(X_i) \xrightarrow{P} \mathbb{E}(w(X)) = I ]

    • the standard error of the estimate

      [ \widehat{se} = \frac{s}{\sqrt{N}}, \qquad s^2 = \frac{\sum_{i=1}^N (\widehat{Y}_i - \widehat{I})^2}{N - 1} ;;\text{ and };; Y_i = w(X_i) ]

    • region estimate: $(1 - \alpha)$ confidence interval = $\widehat{I} \pm z_{\alpha/2} \cdot \widehat{se}$

    • $N \nearrow \implies CI \searrow$

  • Issue on basic Monte Carlo method

    • issue: no guarantee $\pi(\theta | \mathcal{D}_n)$ w/ a known distribution
    • solution: importance sampling - a generalization of basic Monte Carlo

Monte Carlo Method for Integral

  • Generalized Monte Carlo integration methods
    • the integral w/ $f(x)$ a probability density function

      [ I = \int_a^b h(x) f(x) dx ]

    • special case: $f \sim \text{Uniform}(a, b)$

    • drawing $X_1, \dots, X_N \sim f$

      [ \widehat{I} := \frac{1}{N} \sum_{i=1}^N h(X_i) ]

Monte Carlo Method for Gaussian Distribution

  • Numerical methods for Gaussian density
    • the standard normal PDF

      [ f(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2} ]

    • computing the CDF at some point $x$

      [ I = \int_{-\infty}^x f(s) ds = \Phi(x) \implies I = \int h(s) f(s) ds, \quad h(s) = \begin{cases} 1 & s < x \ 0 & s \geq x \end{cases} ]

    • generating $X_1, \dots, X_N \sim N(0, 1)$ and

      [ \widehat{I} = \frac{1}{N} \sum_i h(X_i) = \frac{\text{number of observations w/ their values} \leq x}{N} ]

Monte Carlo Method for Two Binomials

  • Bayesian inference for two binomial
    • freqentist analysis
      • $X \sim \text{Binomial}(n, p_1), ,Y \sim \text{Binomial}(m, p_2)$

      • task: to estimate $\delta = p_2 - p_1$

      • the MLE: $\widehat{\delta} = \widehat{p}_1 - \widehat{p}_2 = (Y/m) - (X/n)$

      • the standard error $\widehat{se}$ using the delta method

        [ \widehat{se} = \sqrt{\frac{\widehat{p}_1 (1 - \widehat{p}_1)}{n} + \frac{\widehat{p}_2 (1 - \widehat{p}_2)}{m}} ]

      • 95% confidence interval: $\widehat{\delta} \pm 2 \cdot \widehat{se}$

    • Bayesian analysis
      • the prior: $\pi(p_1, p_2) = \pi(p_1) \pi(p_2) = 1 \implies$ a flat prior on $(p_1, p_2)$

      • the posterior

        [ \pi(p_1, p_2 ,|, X, Y) \propto p_1^X(1 - p_1)^{n - X} p_2^Y (1 - p_2)^{m-Y} ]

      • the posterior mean of $\delta$

        [ \overline{\delta} = \int_0^1\int_0^1 \delta(p_1, p_2) \pi(p_1, p_2 ,|, X, Y),dp_1 dp_2 = \int_0^1\int_0^1 (p_2 - p_1) \pi(p_1, p_2 ,|, X, Y) ,dp_1 dp_2 ]

      • obtaining the posterior CDF w/ $A = { (p_1, p_2): p_2 - p_1 \leq c }$

        [ F(c ,|, X, Y) = \mathbb{P}(\delta \leq c ,|, X, Y) = \int_A \pi(p_1, p_2 ,|, X, Y) ]

      • then differentiate $F \to$ too complicated

    • simulation approach
      • $\pi(p_1, p_2 ,|, X, Y) = \pi(p_1 ,|, X) \pi(p_2 ,|, Y) \implies p_1, p_2$ independent under the posterior distribution

      • simulate $(P_1^{(1)}, P_2^{(1)}), \dots, (P_1^{(N)}, P_2^{(N)})$ from the posterior by drawing ($i = 1, \dots, N$)

        [\begin{align*} P_1^{(i)} \sim \text{Beta}(X + 1, n - X + 1) \ P_2^{(i)} \sim \text{Beta}(Y + 1, n - Y + 1) \end{align*}]

      • $\delta^{(i)} = P_2^{(i)} - P_1{(i)} \implies$

        [ \overline{\delta} \approx \frac{1}{N} \sum_{i=1}^N \delta^{(i)} ]

      • 95% posterior interval for $\delta$:

        • sorting the simulated values
        • finding the .025 and .975 quantile
      • the posterior density $f(\delta ,|, X, Y)$

        • applying density estimation techniques to $\delta^{(1)}, \dots, \delta^{(N)}$
        • plotting a histogram

Monte Carlo Method w/ Multiparameters

  • Bayesian inference for multiple parameters
    • task: to estimate the dose at which the animals w/ 50% chance of dying, LD50

      [ \delta = x_j^* \qquad j^* = \min{ j: p_j \geq 1/2 } ]

    • $\delta$ implicitly just a complicated function of $p_1, \dots, p_{10} \to \delta = g(p_1, \dots, p_{10})$

    • known $(p_1, \dots, p_{10}) \implies$ find $\delta$

    • the posterior mean of $\delta$ w/ $A = { (p_1, \dots, p_{10}): p_1 \leq \cdots \leq p_{10}) }$

      [ \int\int\cdots\int_A g(p_1, \dots, p_{10})\pi(p_1, \dots, p_{10} ,|, Y_1, \dots, Y_{10}) ,dp_1 dp_2 \cdots dp_{10} ]

    • the posterior CDF of $\delta$ w/ $B = A \cup { (p_1, \dots, p_{10}): g(p_1, \dots, p_{10}) \leq c }$

      [\begin{align*} F(c ,|, Y_1, \dots, Y_{10}) &= \mathbb{P}(\delta \leq c ,|, Y_1, \dots, Y_{10}) \ &= \int\int \cdots\int_B \pi(p_1, \cdots, p_{10} ,|, Y_1, \dots, Y_{10}) dp_1 dp_2 \cdots dp_{10} \end{align*}]

    • simulation procedure

      • taking a flat prior truncated over $A$

      • each $P_i \sim$ Beta distribution

      • drawing from the posterior

        • drawing $P_i \sim \text{Beta}(Y_i + 1, ,n - Y_i +1), ;i = 1, \dots, 10$
        • if $P_1 \leq P_2 \leq \dots \leq P_{10}$ keeping this draw. Otherwise, throw it away and draw again until getting one to keep
        • let $\delta = x_j^\ast$ w/ $j^\ast = \min { j: P_j &gt; 1/2 }$
      • repeat $N$ times to get $\delta{(1)}, \dots, \delta^{(N)}$ and take

        [ \mathbb{E}(\delta ,|, Y_1, \dots, Y_{10}) \approx \frac{1}{N} \sum_{i=1}^N \delta^{(i)} ]

      • estimate the probability mass function as $\delta$ a discrete variable

        [ \mathbb{P}(\delta = x_j ,|, Y_1, \dots, Y_{10}) \approx \frac{1}{N} \sum_{i=1}^N I(\delta^{(i)} = x_j) ]

Markov Chain Monte Carlo Methods

  • Non-standard distributions and Bayesian approach

    • non-conjugate distribution or nuisance parameters
      • more complex Bayesian analysis
      • not possible to derive the posterior distribution in an algebraic form
    • Markov chain Monte Carlo methods: developed as a remarkably effective means of sampling from the posterior distribution of interest w/o knowing its algebraic form
  • Essential components of MCMC methods

    • replacing analytic methods by simulation
    • sampling from the posterior distribution
    • starting the simulation
    • checking convergence
  • Markov Chain Monte Carlo (MCMC) methods

    • to resolve the choice of $q(\theta)$, in particular, high dimensional $\theta$
    • used to generate (dependent) samples from the posterior distribution, only using the posterior kernel $K(\theta; Y)$
    • the posterior distribution: stationary distribution of ${\theta^{(l)}: l = 1, 2, \dots}$
    • creating a suitable transition kernel of a Markov Chain
      • the Metropolis algorithm: most popular approaches
      • the Metropolis-Hasting algorithm: generalized version
    • generated samples
      • MCMC method: dependent
      • MC methods: independent sample
    • MCMC samples
      • not directly obtained from the posterior distribution
      • obtained in an asymptotic sense $\to$ discarding the first thousand samples
      • ${\theta^{(l)}: l = 1, 2, \dots} \to {\theta^{(l)}: l = B+1, B+2, \dots, B &gt; 1}$
    • estimating $\eta = \eta(\theta)$ based on samples $\theta^{(l)} \implies$ computing $\overline{\eta} = \sum_{l=B+1}^N \eta^{(l)} / (N-B) \approx E[\eta|Y], \exists; N &gt;&gt; B &gt;&gt; 1$
  • Markov chain Monte Carlo methods

    • constructing a Markov chain $X_1, X_2, \dots$ w/ $f$ as their stationary distribution

    • problem: estimating the integral $I = \int h(x) f(x) dx$

    • under certain conditions, the integral following

      [ \frac{1}{N} \sum_{i=1}^N h(X_i) \xrightarrow{P} \mathbb{E}_f(h(X)) = I ]

  • Interpretation of Markov chain

    • a Markov chain w/ the transition kernel $p(y|x)$ = the probability of transiting from $x$ to $y$
    • $f(x) = \int f(y) p(x|y) dy \implies f$ is a stationary distribution for the Markov chain
    • $f$ as a stationary distribution for a Markov chain $\implies$ the data from a sample run of the Markov chain approximating the distribution $f$
    • designing a Markov chain w/ stationary distribution $f \implies$ running the Markov chain and using the resulting data as if it were a sample from $f$

Metropolis-Hastings Algorithm

  • Metropolis-Hastings algorithm
    • algorithm

      choose $X_0$ arbitrary
      Given $X_0, X_1, \dots, X_i$, generate $X_{i+1}$ as following:
      1. generate a proposal or candidate value $Y \sim q(y | X_i)$
      2. Evaluate $r \equiv r(X_i, Y)$ where
      $r(x, y) = \min \{ \frac{f(y)}{f(x)}\,\frac{q(x|y)}{q(y|x)}, \; 1 \}$
      3. Set
      $X_{i+1} = \begin{cases} Y & \text{with probability } r \\ X_i & \text{with probability } 1 - r \end{cases}$

    • simple way to execute step 3: generate $U \sim \text{Uniform}(0, 1)$

    • common choice:

      • $q(Y|x) \sim N(x, b^2), ,b &gt;0 \ni$ the proposal drawing from a normal, centered at the current value

      • the proposal density $q$ symmetric $\implies q(y|x) = q(x|y)$ and

        [ r = \min\left{ \frac\left{f(Y)}{f(X_i)}, ,1 \right} ]

    • constructed $X_0, X_1, \dots \to$ Markov chain

    • the chain mixing well: the sample from the Markov chain starts look like the target distribution $f$ quickly

      • tuning parameter ($b$) $\implies$ the efficiency and mixing-well of the chain

Random-Walk-Metropolis-Hastings Algorithm

  • Random-Walk-Metropolis-Hastings method
    • drawing a proposal $Y$ of the form

      [ Y = X_i + \epsilon_i \implies q(y|x) = g(y-x) ]

    • the acceptance probability

      [ r(x, y) = \min \left{ 1, \frac{f(y)}{f(x)} \right} ]

    • common choice: $g \sim N(0, b^2)$

    • issue: choosing $b$ to mix well for the Markov chain

    • rule of thumb: choosing $b \ni$ about 50% of the proposal are accepted

Independence-Metropolis-Hastings Algorithm

  • Independence-Metropolis-Hastings method
    • $g$ chosen to be an approximation to $f$

    • the acceptance probability

      [ r(x, y) = \min \left{ 1, \frac{f(y)}{f(x)} \frac{g(x)}{f(y)} \right} \ \min \left{ 1, \frac{f(y)}{g(y)} \frac{g(x)}{f(x)} \right} ]

Gibbs Sampling

  • Gibbs sampling

    • a way to turn a high-dimensional problem into several one-dimensional problems
    • a bivariate problem: $(X, Y)$ w/ density $f_{X, Y}(x, y)$
      • the acceptance probability for the Metropolis-Hastings algorithm

        [ r((X_n, Y_n), (X_n, Y)) = \min \left{ 1, \frac{f(X_n, Y)}{f(X_n, Y_n)} \frac{f_{Y|X}(Y_n|X_n)}{f_{Y|X}(Y|X_n)} \right} ]

      • iterating until convergence

        [\begin{align*} X_{n+1} &\sim f_{X|Y}(x ,|, Y_n) \ Y_{n+1} &\sim f_{Y|X}(y ,|, X_{n+1}) \end{align*}]

  • Normal hierarchical model w/ Gibbs sampling

    • making some transformations to use some normal approximations

      • $\widehat{p}_i = Y_i / n_i \approx N(p_i, s_i), \quad s_i = \sqrt{\widehat{p}_i(1 - \widehat{p}_i)/n_i} $
      • $\psi_i = \log\left(p_i/(1-p_i)\right), \quad Z \equiv \widehat{\psi}_i = \log(\widehat{p}_i / (1 - \widehat{p}_i))$
    • by delta method

      [ \widehat{\psi}_i \approx N(\psi_i, \sigma_i^2), \qquad \sigma_i^2 = \frac{1}{n\widehat{p}_i (1 - \widehat{p}_i)} ]

    • the normal approximation for $\psi$ more accurate than the normal approximation for $p$

    • the hierarchical model

      [ \psi_i \sim N(\mu, \tau^2), \qquad Z_i ,|, \psi_i \sim N(\psi_i, ,\sigma_i^2) ]

    • the likelihood function

      [ \mathcal{L}_n(\theta) \propto \prod_i \exp\left( -\frac{1}{2}(\psi_i - \mu)^2 \right) \exp\left( -\frac{1}{2\sigma_i^2}(Z_i - \psi_i)^2 \right) ]

    • the prior $f(\mu) \propto 1 \implies$ the posterior proportional to the likelihood

    • using Gibbs sampling to find the conditional distribution of each parameter conditional on all the others, $f(\mu ,|, \text{rest})$

      • ignoring the terms not related to $\mu$: $\mu ,|, \text{rest} \sim N(b, ,1/k)$
    • fining $f(\psi ,|, \text{rest})$ by ignoring any terms irrelevant to $\psi$: $\psi_i ,|, \text{rest} \sim N(e_i, ,d_i^2)$

    • the Gibbs sampling algorithm involving iterating the following steps $N$ times

      [\begin{array}{rcl} \text{draw } \mu & \sim & N(b, ,v^2)\ \text{draw } \psi_1 & \sim & N(e_1, ,d_1^2) \ & \vdots & \ \text{draw } \psi_k & \sim & N(e_k, ,d_k^2) \end{array}]

    • at each step, the most recently drawn version of each variable used

  • Gibbs sampling w/ Metropolis-Hastings

    • not knowing how to draw samples $\to$ using the Gibbs sampling algorithm by drawing each observation using a Metropolis-Hastings step

    • algorithm

      (1a) draw a proposal $Z \sim q(z \,|\, X_n)$
      (1b) Evaluate

      [ r = \min\left{ \frac{f(Z, Y_n)}{f(X_n, Y_n)} \frac{q(X_n | Z)}{q(Z | X_n)}, ;1 \right} ]

      (1c) Set

      [ X_{n+1} = \begin{cases} Z & \text{with probability } r \ X_n & \text{with probability } 1- r \end{cases} ]

      (2a) Draw a proposal $Z \sim \tilde{q}(z \,|\, Y_n)$
      (2b) Evaluate

      [ r = \min \left{ \frac{f(X_{n+1}, Z)}{f(X_{n+1}, Y_n)} \frac{\tilde{q}(Y_n | Z)}{\tilde{q}(Z | Y_n)}, ; 1 \right} ]

      (2c) Set

      [ Y_{n+1} = \begin{cases} Z & \text{with probability } r \ Y_n & \text{text probability } 1 - r \end{cases} ]

    • step (1) (and similar step (2)), w/ $Y_n$ fixed, sampling from $f(Z | Y_n)$ equivalent to sampling from $f(Z, Y_n)$, as the ratios are identical

      [ \frac{f(Z, Y_n)}{f(X_n, Y_n)} = \frac{f(Z | Y_n)}{f(X_n | Y_n)} ]