Skip to content

Latest commit

 

History

History
399 lines (273 loc) · 21.3 KB

Stat-Conjugate.md

File metadata and controls

399 lines (273 loc) · 21.3 KB

Conjugate Priors

Overview of Conjugate Priors

  • Prior conjugate family

    • conjugate family: prior densities and their leading posterior densities belonging to the same family
    • the choice of conjugate family not unique
    • exponential family of densities: sampling density w/ a sufficient statistic of constant dimension always finds a conjugate family of prior densities
    • natural conjugate family:
    • subjective prior of informative prior: the parameters of the prior density elicited using a previously collected data or expert knowledge
    • noninformative prior: no such prior information available or very little knowledge available about the parameter $\theta$
  • Conjugate prior

    • prior: $\theta \sim N(a, b^2)$

    • the posterior for $\theta$

      [ \theta ,|, \mathcal{D}_n \sim N(\overline{\theta}, , \tau^2) \tag{4} ]

    • region estimate: find posterior interval = find $C = (c, d) \to \mathbb{P}(\theta \in C ,|, \mathcal{D}_n) = 0.95$

    • $\exists; c$ and $d \text{ s.t. } \mathbb{P}(\theta < c ,|, \mathcal{D}_n) = 0.025$

    • find $ c \text{ s.t. }$

      [ \mathbb{P}(\theta < c ,|, \mathcal{D}_n) = \mathbb{P} \left( \frac{\theta - \overline{\theta}}{\tau} < \frac{c - \overline{\theta}}{\tau}\right) = \mathbb{P}\left( Z < \frac{c - \overline{\theta}}{\tau} \right) = 0.025 ]

  • Definition of conjugate priors

    • a prior distribution closed under sampling distribution

    • $\mathcal{P}$: a family of prior distribution

    • Definition. $\forall; \theta, , \exists; p(\cdot ,|, \theta) \in \mathcal{F}$ over a sample space $\mathcal{X}$. The posterior

      [ p(\theta ,|, \mathbf{x}) = \frac{p(\mathbf{x} ,|, \theta); \pi(\theta)}{\int p(\mathbf{x} ,|, \theta); \pi(\theta) ;d\theta} ]

      satisfies $p(\cdot ,|,\mathbf{x}) \in \mathcal{P} \implies$ the family $\mathcal{P}$ is conjugate to the family of sampling distribution $\mathcal{F}$

    • the family $\mathcal{P}$ should be sufficiently restricted, and is typically taken to be a specific parametric family.

Conjugate Priors with Exponential Family

  • General exponential family models

    • $p(\cdot ,|, \theta)$: a standard exponential family model

    • the density w.r.t. a positive measure $\mu$

      [ p(\mathbf{x} ,|, \pmb{\theta}) = \exp\left(\pmb{\theta}^T,\mathbf{x} - A(\pmb{\theta})\right) \tag{5} ]

      • $A(\pmb{\theta})$: the moment generation or log-normalizing constant

        [A(\pmb{\theta}) = \log\left(\int \exp(\pmb{\theta} \mathbf{x} - A(\pmb{\theta})), d\mu(\mathbf{x}) \right)]

    • the density of a conjugate prior for the exponential family

      [\pi_{\mathbf{x}_0,n_0}(\theta) = \frac{\exp(n_0 \mathbf{x}_0^T \pmb{\theta} - n_0A(\pmb{\theta}))}{\int \exp(n_0 \mathbf{x}_0^T \pmb{\theta} - n_0A(\pmb{\theta})),d\pmb{\theta}} ]

    • the posterior

      [\begin{align*} p(\mathbf{x} ,|, \pmb{\theta}) \pi_{\mathbf{x}_0,n_0}(\pmb{\theta}) &= \exp(\pmb{\theta}^T \mathbf{x} - A(\pmb{\theta})) \exp\left(n_0\mathbf{x}0^T\pmb{\theta} - n_0 ,A(\pmb{\theta})\right) \\ &\propto \pi{\frac{\mathbf{x}}{1+n_0}+\frac{n_0\mathbf{x}_0}{1+n_0}, 1+n_0}(\pmb{\theta}) \end{align*}]

    • the prior incorporating $n_0$ "virtual" observations of $\mathbf{x}_0 \in \mathbb{R}^d$

    • after making one "real" observation x: the parameters of the posterior as a mixture of the virtual and actual observation

      [ n_0^\prime = 1 + n_0 \quad \text{ and } \quad \mathbf{x}_0^\prime = \frac{\mathbf{x}}{1 + n_0} + \frac{n_0 \mathbf{x}}{1 + n_0} ]

  • Generalized exponential family model

    • $n$ observations $\mathbf{X}_1, \dots, \mathbf{X}_n \implies$ the posterior

      [ p(\pmb{\theta} ,|, \mathbf{X}_1, \dots, \mathbf{X}_n) \propto \exp \left( (n + n_0) \left( \frac{n\overline{\mathbf{X}}}{n + n_0} + \frac{n_0\mathbf{x}_0}{n+n_0} \right)^T \pmb{\theta} - (n + n_0)A(\pmb{\theta}) \right) ]

    • the parameters of the posterior

      [ n_0^\prime = n + n_0 \quad \text{ and } \quad \mathbf{x}_0^\prime = \frac{n \overline{\mathbf{X}}}{n+n_0} + \frac{n_0\mathbf{x}_0}{n+n_0} ]

    • the expectation w.r.t. $\pi_{\mathbf{x}_0, n_0}$

      [ \mathbb{E}[\nabla A(\pmb{\theta})] = \int \nabla A(\pmb{\theta}) \pi_{\mathbf{x}_0, n_0}(\pmb{\theta}),d\pmb{\theta} &= \mathbf{x}0 - \frac{1}{n_0} \int \nabla \pi{\mathbf{x}_0, n_0} (\pmb{\theta}),d\pmb{\theta} = \mathbf{x}_0 ]

    • more generally,

      [ \mathbb{E}[\nabla A(\pmb{\theta}) ,|, \mathbf{X}_1, \dots, \mathbf{X}_n] = \frac{n\overline{\mathbf{X}}}{n_0+n} + \frac{n_0 \mathbf{x}_0}{n_0+n} ]

    • under appropriate regularity conditions, the converse also holds, so that linearity of $\nabla A(\pmb{\theta}) ,|, \mathbf{X}_1, \dots, \mathbf{X}_n$ is sufficient for conjugacy

  • Theorem (exponential family)

    • open space $\Theta \subset \mathbb{R}^d$

    • $\mathbf{X}$: a sample of size one from the exponential family $p(\cdot ,|, \pmb{\theta})$

    • the support of $\mu$ containing an open interval

    • $\pi(\pmb{\theta})$: a prior density not concentrated at a single point

    • the posterior mean of $\nabla A(\pmb{\theta})$ given a single observation $\mathbf{X}$ is linear

      [ \mathbb{E}[\nabla A(\theta) ,|, X] = a\mathbf{X} + \mathbf{b} \quad\iff\quad \pi(\pmb{\theta}) \propto \exp\left( \frac{1}{a} \mathbf{b}^T \pmb{\theta} - \frac{1 -a}{a} A(\pmb{\theta}) \right) ]

    • similar result holds w/ discrete measure $\mu$

  • Conjugate priors for discrete exponential family distributions

    Conjugate priors for discrete exponential family distributions
    Sample Space Sampling Dist. Conjugate Prior Posterior
    $X = \{0, 1\}$ $\text{Bernoulli}(\theta)$ $\text{Beta}(\alpha, \,\beta)$ $\text{Beta}\left(\alpha + n\overline{X}, \,\beta + n\left(1-\overline{X}\right)\right)$
    $X = \mathbb{Z}_+$ $\text{Poisson}(\lambda)$ $\text{Gamma}(\alpha, \,\beta)$ $\text{Gamma}\left(\alpha + n\overline{X}, \,\beta + n\right)$
    $X = \mathbb{Z}_{++}$ $\text{Geometric}(\theta)$ $\text{Gamma}\left(\alpha, \,\beta\right)$ $\text{Gamma}\left(\alpha+n, \,\beta+n\overline{X}\right)$
    $X = \mathbb{H}_k$ $\text{Multinomial}(\theta)$ $\text{Dirichlet}\left(\alpha+n\overline{X}\right)$ $\text{Dirichlet}\left(\alpha\right)$
  • Conjugate priors for some continuous distributions

    Conjugate priors for some continuous distributions
    Sampling Dist. Conjugate Prior Posterior
    $\text{Uniform}(\theta)$ $\text{Pareto}\left(\nu_0, \,k\right)$ $\text{Pareto}\left(\max\{\nu_0, \,X_{(n)}\}, \,n+k\right)$
    $\text{Exponential}(\theta)$ $\text{Gamma}\left(\alpha, \,\beta\right)$ $\text{Gamma}\left(\alpha+n, \,\beta+n\overline{X}\right)$
    $N(\mu, \,\sigma^2)$, known $\sigma^2$ $N(\mu_0, \,\sigma_0^2)$ $N\left(\left(\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}\right)^{-1}\left(\frac{\mu_0}{\sigma_0^2} + \frac{n\overline{X}}{\sigma^2}\right), \,\left(\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}\right)^{-1}\right)$
    $N(\mu, \,\sigma^2)$, known $\mu$ $\text{InvGamma}(\alpha, \,\beta)$ $\text{InvGamma}\left(\alpha+\frac{n}{2}, \,\beta + \frac{n}{2} \, \overline{(X - \mu)^2}\right)$
    $N(\mu, \,\sigma^2)$, known $\mu$ $\text{ScaledInv-}\chi^2(\nu_0, \,\sigma_0^2)$ $\text{ScaledInv-}\chi^2\left(\nu_0+n, \,\beta + \frac{\nu_0+\sigma_0^2}{\nu_0 + n} + \frac{n\,\overline{(X-\mu)^{2}}}{\nu_0 + n} \right)$
    $N(\pmb{\mu}, \,\pmb{\Sigma})$, known $\pmb{\Sigma}$ $N(\pmb{\mu}_0, \,\pmb{\Sigma}_0)$ $N\left(\mathbf{K}\left(\Sigma_0^{-1} \mu_0 + n \Sigma^{-1} \overline{X}\right), \,\mathbf{K}\right), \\ \hspace{10em}\;\mathbf{K} = (\Sigma_0^{-1} + n\Sigma^{-1})^{-1}$
    $N(\pmb{\mu}, \,\pmb{\Sigma})$, known $\pmb{\mu}$ $\text{InvWishart}(\nu_0, \,\mathbf{S}_0)$ $\text{InvWishart}(\nu_0+n, \,\mathbf{S}_0+n \overline{\mathbf{S}}), \;\overline{\mathbf{S}}$ sample covariance

Uniform-Bernoulli likelihood model

  • Uniform-Bernoulli likelihood model

    • sampling distribution: $\exists; \mathcal{D}_n = {X_1, \dots, X_n}, ;; X_1, \dots, X_n \sim Bernoulli(\theta)$

    • prior distribution: uniform distribution as $\pi(\theta) = 1$

    • $S_n = \sum_{i=1}^n X_i$: the number of success

    • the posterior distribution

      [\begin{align*} p(\theta,|,\mathcal{D}_n) & \propto \pi(\theta) \mathcal{L}_n(\theta) = \theta^{S_n} (1-\theta)^{n - S_n} = \theta^{S_n+1-1} (1-\theta)^{n - S_n +1 -1} \\ &= \frac{\Gamma(n+2)}{\Gamma(S_n + 1) \Gamma(n-S_n+1)} \theta^{(S_n+1)-1} (1-\theta)^{(n-S_n+1)-1} \\ \therefore;\theta,|,\mathcal{D}_n &\sim \text{Beta}(S_n+1, n-S_n +1) \end{align*}]

    • the Bayesian posterior point estimator

      [ \overline{\theta} = \frac{S_n + 1}{n+2} = \lambda_n \hat{\theta} + (1 - \lambda_n) \tilde{\theta} ]

      • $\hat{\theta} = S_n / n$: the maximum likelihood estimate
      • $\tilde{\theta} = 1/2$: the prior mean
      • $\lambda_n = n/(n+2) \approx 1$
    • the Bayesian posterior credible interval: 95% posterior interval = $\int_a^b p(\theta,|,\mathcal{D}_n) d\theta = .95$

  • Flat priors not invariant

    • contradiction
      • the notation of a flat prior not well define
      • a flat prior on a parameter $\nRightarrow$ a flat prior on a transformed version of this parameter
    • flat priors not transformed invariant

Beta-Bernoulli likelihood model

  • Beta-Bernoulli likelihood model
    • sampling distribution: $\exists; \mathcal{D}_n = {X_1, \dots, X_n}, ;; X_1, \dots, X_n \sim \text{Bernoulli}(\theta)$ w/ $\hat{\theta} = S_n/n$

    • the prior distribution: $\theta \sim \text{Beta}(\alpha, \beta)$ w/ prior mean $\theta_0 = \alpha/(\alpha+\beta)$

    • the posterior distribution: $\theta ,|, \mathcal{D}_n \sim \text{Beta}(\alpha + S_n, \beta + n - S_n)$

    • the flat (uniform) prior: $\alpha = \beta = 1$

    • the posterior mean:

      [ \overline{\theta} = \frac{\alpha + S_n}{\alpha + \beta + n} = \left(\frac{n}{\alpha+\beta+n}\right) \hat{\theta} + \left(\frac{\alpha+\beta}{\alpha+\beta+n} \right) \theta_0 ]

Dirichlet-Multinomial likelihood Model

  • Dirichlet-Multinomial likelihood Model
    • sampling distribution: $\exists; \mathcal{D}_n = {X_1, \dots, X_n}, ;; X_1, \dots, X_n \sim \text{Bernoulli}(\theta)$ w/ $\hat{\theta} = S_n/n$

    • prior distribution: Dirichlet prior

    • the sample space of the multinomial w/ $K$ outcomes as the set of vertices of the $K$-dim hypercube $\mathbb{H}_K$, mad up of vectors w/ exactly only one 1 and the remaining elements 0

      [ x = \underbrace{(0, 0, \dots, 0, 1, 0, \dots, 0)^T}_{K\text{ places}} ]

    • $\exists; \mathbf{X}i = (X{i1}, \dots, X_{iK})^T \in \mathbb{H}_K$,

      [ \underbrace{\theta \sim \text{Dirichlet}(\pmb{\alpha})}_{\text{Prior}} ;\text{ and }; \underbrace{\mathbf{X}i ,|, \theta \sim \text{Multinomial}(\pmb{\theta})}{\text{likeliehood}} ; \forall; i=1, 2, \dots, n]

      $\implies$ the posterior satisfies

      [ p(\pmb{\theta} ,|, \mathbf{X}1, \dots, \mathbf{X}n) \propto \mathcal{L}n(\theta)\pi(\theta) \propto \prod{i=1}^n \prod{j=1}^K \theta_j^{X{ij}} \prod_{j=1}^K \theta_j^{\alpha_j - 1} = \prod_{j=1}^K \theta_j^{\sum_{i=1}^n X_{ij}+\alpha_j-1} ]

    • the posterior distribution w/ $\overline{\mathbf{X}} = \sum_{i=1}^n \mathbf{X}_i / n \in \Delta_K$

      [ \pmb{\theta} ,|, \mathbf{X}_1, \dots, \mathbf{X}_n \sim \text{Dirichlet}(\alpha+ n \overline{\mathbf{X}})]

    • the posterior mean

      [ \mathbb{E}(\theta ,|, \mathbf{X}1, \dots, \mathbf{X}n) = \left(\frac{\alpha_1 + \sum{i=1}^n X{i1}}{\sum_{i=1}^K \alpha_i + n}, \dots, \frac{\alpha_K + \sum_{i=1}^n X_{iK}}{\sum_{i=1}^K \alpha_i + n} \right)^T ]

    • prior conjugate w.r.t. the mode: the prior as Dirichlet distribution $\to$ the posterior as Dirichlet distribution

Gamma-Poisson likelihood model

  • Gamma-Poisson likelihood model
    • Poisson model w/ rate $\lambda \geq 0$ in the sample space $\mathcal{X} = \mathbb{Z}_+ \text{ s.t. }$

      [ \mathbb{P}(X = x ,|, \lambda) = \frac{\lambda^x}{x!} e^{-\lambda} \propto \exp(x\log\lambda - \lambda) ]

    • the natural parameter: $\theta = \log\lambda$

    • the conjugate prior

      [ \pi_{x_0, n_0}(\lambda) \propto \exp(n_0x_0\log \lambda - n_0\lambda) ]

    • a better parameterization of the prior as the $\text{Gamma}(\alpha, \beta)$

      [ \pi_{\alpha, \beta}(\lambda) \propto \lambda^{\alpha-1} (1-\lambda)^{-\beta\lambda} ]

    • sampling distribution: $\exists; X_1, \dots, X_n$ observations from $\text{Poisson}(\lambda)$

    • the posterior

      [ \lambda ,|, X_1, \dots, X_n \sim \text{Gamma}(\alpha + n\overline{\mathbf{X}},, \beta+n) ]

    • the prior acts as if $\beta$ virtual observations were made, with a total count of $\alpha -1$ among them

Gamma-Exponential likelihood model

  • Gamma-Exponential likelihood model
    • exponential distribution w/ the sample space $\mathcal{X} \in \mathbb{R}_+ \text{ s.t. }$

      [ p(x ,|, \theta) = \theta e^{-x\theta} ]

    • exponential model widely used for survival times or waiting times btw events

    • the conjugate prior: Gamma distribution in the most convenient parameterization

      [ \pi_{\alpha, \beta} \propto \theta^{\alpha - 1} e^{-\beta\theta} ]

    • sampling distribution: $\exists; X_1, \dots, X_n$ observed data from $\text{Exponential}(\theta)$

    • the posterior

      [ \theta ,|, X_1, \dots, X_n \sim \text{Gamma}(\alpha + n,, \beta + n\overline{X}) ]

    • the prior acts if $\alpha -1$ virtual example are used, w/ a total waiting time of $\beta$

Gamma-Geometric likelihood model

  • Gamma-Geometric likelihood model
    • the geometric distribution

      • the discrete analogue of the exponential model
      • sample space $\mathcal{X} = \mathbb{Z}_{++}$, the strictly positive integers
      • the density

      [ \mathbb{P}(X = x ,|, \theta) = (1-\theta)^{x-1} \theta ]

    • the conjugate prior: $\text{Gamma}(\alpha, \beta)$

    • sampling distribution: $\exists; X_1, \dots, X_n$ observed data from $\text{Geometric}(\theta)$

    • the posterior

      [ \theta ,|, X_1, \dots, X_n \sim \text{Gamma}(\alpha + n, ,\beta + n\overline{X}) ]

InvGamma-Gaussian likelihood model

  • InvGamma-Gaussian likelihood model
    • sampling distribution: $N(\mu, \sigma^2)$

    • the likelihood function

      [\begin{align*} p(X_1, \dots, X_n ,|, \sigma^2) &\propto (\sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu^2) \right) \\ &= (\sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} n, \overline{(X - \mu)^2} \right) \ & \hspace{10em} \text{with }\left(\overline{(X-\mu)^2} = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 \right) \end{align*}]

    • the conjugate prior

      • inverse Gamma distribution: $1/\theta \sim \text{Gamma}(\alpha, \beta)$

      • the density

        [ \pi_{\alpha, \beta}(\theta) \propto \theta^{-(\alpha+1)} e^{-\beta/\theta} ]

    • the posterior distribution of $\sigma^2$

      [ \sigma^2 ,|, X_1, \dots, X_n \propto \text{InvGamma}\left(\alpha + \frac{n}{2},, \beta + \frac{n}{2}, \overline{X - \mu)^2},\right) ]

ScaledInv-$\chi^2$-Gaussian likelihood model

  • ScaledInv-$\chi^2$-Gaussian likelihood model
    • the prior: scaled inverse $\chi^2$ distribution of $\sigma^2\nu_0Z;$ w/ $Z \sim \chi_{\nu_0}^2$

      [ \pi_{\nu_0, \sigma_0^2}(\theta) \propto \theta^{-(1+\nu_0/2)} \exp\left( -\frac{\nu_0 \sigma^2_0}{2\theta} \right) ]

    • the posterior

      [ \sigma^2 ,|, X_1, \dots, X_n \sim \text{ScaledInv-}\chi^2 \left( \nu_0 +n, ,\frac{\nu_0 \sigma_0^2}{\nu_0 + n} + \frac{n, \overline{(X - \mu)^2}}{\nu_0 + n} \right) ]

InvWhishart-Gaussian likelihood Model

  • InvWhishart-Gaussian likelihood Model
    • sampling distribution: $\exists; X_1, \dots, X_n$ observed data from $N(\mathbf{0}, \pmb{\Sigma}), ;\pmb{\Sigma} \in \mathbb{R}^{n\times n}$ as covariance (positive semi-defined matrix)

    • the posterior: an inverse Wishart prior multiplies the likelihood

      [\begin{align*} &p(\mathbf{X}_1, \dots, \mathbf{X}n ,|, \pmb{\Sigma})\pi{\nu_0, \mathbf{S}_0} \\ & \hspace{3em}\propto |\pmb{\Sigma}|^{-n/2} \exp \left( -\frac{n}{2} \text{tr}(\overline{\mathbf{S}}\pmb{\Sigma}^{-1}) \right) |\pmb{\Sigma}|^{-(\nu_0+d+1)/2} \exp \left( -\frac{1}{2}\text{tr}(\mathbf{S}_0 \pmb{\Sigma}^{-1}) \right) \\ &\hspace{1em}= |\pmb{\Sigma}|^{-(n+\nu_0+d+1)/2} \exp \left( -\frac{1}{2} \text{tr}\left(\left(n \overline{\mathbf{S}} + \mathbf{S}_0\right) \pmb{\Sigma}^{-1}\right) \right) \end{align*}]

      • the empirical covariance: $\overline{\mathbf{S}} = \frac{1}{n} \sum_{i=1}^n \mathbf{X}_i\mathbf{X}_i^T$
    • the posterior

      [ \pmb{\Sigma} ,|, \mathbf{X}_1, \dots, \mathbf{X}_n \sim \text{InvWishart}(\nu_0 + n,, \mathbf{S}_0 + n\overline{\mathbf{S}}) ]

    • similarly, the conjugate prior for the inverse covariance $\pmb{\Sigma}^{-1}$ (precision matrix) is a Wishart

Pareto-Uniform likelihood model

  • Pareto-Uniform likelihood model
    • uniform distribution: $\text{Uniform}(0, \theta),, \theta \geq 0$

    • sampling distribution: $\exists; X_1, \dots, X_n$ observed data from $\text{Uniform}(0, \theta)$

    • the prior of $\theta$: $\text{Pareto}(k, \nu_0)$

    • let $X_{(n)} = \max_{1 \leq i \leq n} { X_i }$

      • $\nu_0 &gt; X_{(n)} \implies$

        [ \mathcal{L}n(\theta) \pi{k, \nu_0}(\theta) = 0 ]

      • $\nu_o \leq X_{(n)} \implies$ the posterior ($\theta$ must be at least $X_{(n)}$)

        [ \mathcal{L}n(\theta) \pi{k, \nu_0}(\theta) \propto \frac{1}{\theta^n} \frac{1}{\theta^{k+1}} ]

    • the posterior

      [ \theta ,|, X_1, \dots, X_n \sim \text{Pareto}\left(n + k, \max{X_{(n)}, ,\nu_0}\right) ]

    • $n \nearrow ;\to$ the decay of the posterior $\nearrow \implies$ a more peaked distribution around $X_{(n)}$

    • the parameter $K$ controls the sharpness of the decay for small $n$