Conjugate Priors

Overview of Conjugate Priors

Prior conjugate family
- conjugate family: prior densities and their leading posterior densities belonging to the same family
- the choice of conjugate family not unique
- exponential family of densities: sampling density w/ a sufficient statistic of constant dimension always finds a conjugate family of prior densities
- natural conjugate family:
- subjective prior of informative prior: the parameters of the prior density elicited using a previously collected data or expert knowledge
- noninformative prior: no such prior information available or very little knowledge available about the parameter $\theta$
Conjugate prior
- prior: $\theta \sim N(a, b^2)$
- the posterior for $\theta$
  
  [ \theta ,|, \mathcal{D}_n \sim N(\overline{\theta}, , \tau^2) \tag{4} ]
- region estimate: find posterior interval = find $C = (c, d) \to \mathbb{P}(\theta \in C ,|, \mathcal{D}_n) = 0.95$
- $\exists; c$ and $d \text{ s.t. } \mathbb{P}(\theta < c ,|, \mathcal{D}_n) = 0.025$
- find $ c \text{ s.t. }$
  
  [ \mathbb{P}(\theta < c ,|, \mathcal{D}_n) = \mathbb{P} \left( \frac{\theta - \overline{\theta}}{\tau} < \frac{c - \overline{\theta}}{\tau}\right) = \mathbb{P}\left( Z < \frac{c - \overline{\theta}}{\tau} \right) = 0.025 ]
Definition of conjugate priors
- a prior distribution closed under sampling distribution
- $\mathcal{P}$: a family of prior distribution
- Definition. $\forall; \theta, , \exists; p(\cdot ,|, \theta) \in \mathcal{F}$ over a sample space $\mathcal{X}$. The posterior
  
  [ p(\theta ,|, \mathbf{x}) = \frac{p(\mathbf{x} ,|, \theta); \pi(\theta)}{\int p(\mathbf{x} ,|, \theta); \pi(\theta) ;d\theta} ]
  
  satisfies $p(\cdot ,|,\mathbf{x}) \in \mathcal{P} \implies$ the family $\mathcal{P}$ is conjugate to the family of sampling distribution $\mathcal{F}$
- the family $\mathcal{P}$ should be sufficiently restricted, and is typically taken to be a specific parametric family.

Conjugate Priors with Exponential Family

General exponential family models
- $p(\cdot ,|, \theta)$: a standard exponential family model
- the density w.r.t. a positive measure $\mu$
  
  [ p(\mathbf{x} ,|, \pmb{\theta}) = \exp\left(\pmb{\theta}^T,\mathbf{x} - A(\pmb{\theta})\right) \tag{5} ]
  - $A(\pmb{\theta})$: the moment generation or log-normalizing constant
    
    [A(\pmb{\theta}) = \log\left(\int \exp(\pmb{\theta} \mathbf{x} - A(\pmb{\theta})), d\mu(\mathbf{x}) \right)]
- the density of a conjugate prior for the exponential family
  
  [\pi_{\mathbf{x}_0,n_0}(\theta) = \frac{\exp(n_0 \mathbf{x}_0^T \pmb{\theta} - n_0A(\pmb{\theta}))}{\int \exp(n_0 \mathbf{x}_0^T \pmb{\theta} - n_0A(\pmb{\theta})),d\pmb{\theta}} ]
- the posterior
  
  [\begin{align*} p(\mathbf{x} ,|, \pmb{\theta}) \pi_{\mathbf{x}_0,n_0}(\pmb{\theta}) &= \exp(\pmb{\theta}^T \mathbf{x} - A(\pmb{\theta})) \exp\left(n_0\mathbf{x}0^T\pmb{\theta} - n_0 ,A(\pmb{\theta})\right) \\ &\propto \pi{\frac{\mathbf{x}}{1+n_0}+\frac{n_0\mathbf{x}_0}{1+n_0}, 1+n_0}(\pmb{\theta}) \end{align*}]
- the prior incorporating $n_0$ "virtual" observations of $\mathbf{x}_0 \in \mathbb{R}^d$
- after making one "real" observation x: the parameters of the posterior as a mixture of the virtual and actual observation
  
  [ n_0^\prime = 1 + n_0 \quad \text{ and } \quad \mathbf{x}_0^\prime = \frac{\mathbf{x}}{1 + n_0} + \frac{n_0 \mathbf{x}}{1 + n_0} ]
Generalized exponential family model
- $n$ observations $\mathbf{X}_1, \dots, \mathbf{X}_n \implies$ the posterior
  
  [ p(\pmb{\theta} ,|, \mathbf{X}_1, \dots, \mathbf{X}_n) \propto \exp \left( (n + n_0) \left( \frac{n\overline{\mathbf{X}}}{n + n_0} + \frac{n_0\mathbf{x}_0}{n+n_0} \right)^T \pmb{\theta} - (n + n_0)A(\pmb{\theta}) \right) ]
- the parameters of the posterior
  
  [ n_0^\prime = n + n_0 \quad \text{ and } \quad \mathbf{x}_0^\prime = \frac{n \overline{\mathbf{X}}}{n+n_0} + \frac{n_0\mathbf{x}_0}{n+n_0} ]
- the expectation w.r.t. $\pi_{\mathbf{x}_0, n_0}$
  
  [ \mathbb{E}[\nabla A(\pmb{\theta})] = \int \nabla A(\pmb{\theta}) \pi_{\mathbf{x}_0, n_0}(\pmb{\theta}),d\pmb{\theta} &= \mathbf{x}0 - \frac{1}{n_0} \int \nabla \pi{\mathbf{x}_0, n_0} (\pmb{\theta}),d\pmb{\theta} = \mathbf{x}_0 ]
- more generally,
  
  [ \mathbb{E}[\nabla A(\pmb{\theta}) ,|, \mathbf{X}_1, \dots, \mathbf{X}_n] = \frac{n\overline{\mathbf{X}}}{n_0+n} + \frac{n_0 \mathbf{x}_0}{n_0+n} ]
- under appropriate regularity conditions, the converse also holds, so that linearity of $\nabla A(\pmb{\theta}) ,|, \mathbf{X}_1, \dots, \mathbf{X}_n$ is sufficient for conjugacy
Theorem (exponential family)
- open space $\Theta \subset \mathbb{R}^d$
- $\mathbf{X}$: a sample of size one from the exponential family $p(\cdot ,|, \pmb{\theta})$
- the support of $\mu$ containing an open interval
- $\pi(\pmb{\theta})$: a prior density not concentrated at a single point
- the posterior mean of $\nabla A(\pmb{\theta})$ given a single observation $\mathbf{X}$ is linear
  
  [ \mathbb{E}[\nabla A(\theta) ,|, X] = a\mathbf{X} + \mathbf{b} \quad\iff\quad \pi(\pmb{\theta}) \propto \exp\left( \frac{1}{a} \mathbf{b}^T \pmb{\theta} - \frac{1 -a}{a} A(\pmb{\theta}) \right) ]
- similar result holds w/ discrete measure $\mu$

Conjugate priors for discrete exponential family distributions

Sample Space	Sampling Dist.	Conjugate Prior	Posterior
$X = \{0, 1\}$	$\text{Bernoulli}(\theta)$	$\text{Beta}(\alpha, \,\beta)$	$\text{Beta}\left(\alpha + n\overline{X}, \,\beta + n\left(1-\overline{X}\right)\right)$
$X = \mathbb{Z}_+$	$\text{Poisson}(\lambda)$	$\text{Gamma}(\alpha, \,\beta)$	$\text{Gamma}\left(\alpha + n\overline{X}, \,\beta + n\right)$
$X = \mathbb{Z}_{++}$	$\text{Geometric}(\theta)$	$\text{Gamma}\left(\alpha, \,\beta\right)$	$\text{Gamma}\left(\alpha+n, \,\beta+n\overline{X}\right)$
$X = \mathbb{H}_k$	$\text{Multinomial}(\theta)$	$\text{Dirichlet}\left(\alpha+n\overline{X}\right)$	$\text{Dirichlet}\left(\alpha\right)$

Conjugate priors for some continuous distributions

Sampling Dist.	Conjugate Prior	Posterior
$\text{Uniform}(\theta)$	$\text{Pareto}\left(\nu_0, \,k\right)$	$\text{Pareto}\left(\max\{\nu_0, \,X_{(n)}\}, \,n+k\right)$
$\text{Exponential}(\theta)$	$\text{Gamma}\left(\alpha, \,\beta\right)$	$\text{Gamma}\left(\alpha+n, \,\beta+n\overline{X}\right)$
$N(\mu, \,\sigma^2)$, known $\sigma^2$	$N(\mu_0, \,\sigma_0^2)$	$N\left(\left(\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}\right)^{-1}\left(\frac{\mu_0}{\sigma_0^2} + \frac{n\overline{X}}{\sigma^2}\right), \,\left(\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}\right)^{-1}\right)$
$N(\mu, \,\sigma^2)$, known $\mu$	$\text{InvGamma}(\alpha, \,\beta)$	$\text{InvGamma}\left(\alpha+\frac{n}{2}, \,\beta + \frac{n}{2} \, \overline{(X - \mu)^2}\right)$
$N(\mu, \,\sigma^2)$, known $\mu$	$\text{ScaledInv-}\chi^2(\nu_0, \,\sigma_0^2)$	$\text{ScaledInv-}\chi^2\left(\nu_0+n, \,\beta + \frac{\nu_0+\sigma_0^2}{\nu_0 + n} + \frac{n\,\overline{(X-\mu)^{2}}}{\nu_0 + n} \right)$
$N(\pmb{\mu}, \,\pmb{\Sigma})$, known $\pmb{\Sigma}$	$N(\pmb{\mu}_0, \,\pmb{\Sigma}_0)$	$N\left(\mathbf{K}\left(\Sigma_0^{-1} \mu_0 + n \Sigma^{-1} \overline{X}\right), \,\mathbf{K}\right), \\ \hspace{10em}\;\mathbf{K} = (\Sigma_0^{-1} + n\Sigma^{-1})^{-1}$
$N(\pmb{\mu}, \,\pmb{\Sigma})$, known $\pmb{\mu}$	$\text{InvWishart}(\nu_0, \,\mathbf{S}_0)$	$\text{InvWishart}(\nu_0+n, \,\mathbf{S}_0+n \overline{\mathbf{S}}), \;\overline{\mathbf{S}}$ sample covariance

Uniform-Bernoulli likelihood model

Uniform-Bernoulli likelihood model
- sampling distribution: $\exists; \mathcal{D}_n = {X_1, \dots, X_n}, ;; X_1, \dots, X_n \sim Bernoulli(\theta)$
- prior distribution: uniform distribution as $\pi(\theta) = 1$
- $S_n = \sum_{i=1}^n X_i$: the number of success
- the posterior distribution
  
  [\begin{align*} p(\theta,|,\mathcal{D}_n) & \propto \pi(\theta) \mathcal{L}_n(\theta) = \theta^{S_n} (1-\theta)^{n - S_n} = \theta^{S_n+1-1} (1-\theta)^{n - S_n +1 -1} \\ &= \frac{\Gamma(n+2)}{\Gamma(S_n + 1) \Gamma(n-S_n+1)} \theta^{(S_n+1)-1} (1-\theta)^{(n-S_n+1)-1} \\ \therefore;\theta,|,\mathcal{D}_n &\sim \text{Beta}(S_n+1, n-S_n +1) \end{align*}]
- the Bayesian posterior point estimator
  
  [ \overline{\theta} = \frac{S_n + 1}{n+2} = \lambda_n \hat{\theta} + (1 - \lambda_n) \tilde{\theta} ]
  - $\hat{\theta} = S_n / n$: the maximum likelihood estimate
  - $\tilde{\theta} = 1/2$: the prior mean
  - $\lambda_n = n/(n+2) \approx 1$
- the Bayesian posterior credible interval: 95% posterior interval = $\int_a^b p(\theta,|,\mathcal{D}_n) d\theta = .95$
Flat priors not invariant
- contradiction
  - the notation of a flat prior not well define
  - a flat prior on a parameter $\nRightarrow$ a flat prior on a transformed version of this parameter
- flat priors not transformed invariant

Beta-Bernoulli likelihood model

Beta-Bernoulli likelihood model
- sampling distribution: $\exists; \mathcal{D}_n = {X_1, \dots, X_n}, ;; X_1, \dots, X_n \sim \text{Bernoulli}(\theta)$ w/ $\hat{\theta} = S_n/n$
- the prior distribution: $\theta \sim \text{Beta}(\alpha, \beta)$ w/ prior mean $\theta_0 = \alpha/(\alpha+\beta)$
- the posterior distribution: $\theta ,|, \mathcal{D}_n \sim \text{Beta}(\alpha + S_n, \beta + n - S_n)$
- the flat (uniform) prior: $\alpha = \beta = 1$
- the posterior mean:
  
  [ \overline{\theta} = \frac{\alpha + S_n}{\alpha + \beta + n} = \left(\frac{n}{\alpha+\beta+n}\right) \hat{\theta} + \left(\frac{\alpha+\beta}{\alpha+\beta+n} \right) \theta_0 ]

Dirichlet-Multinomial likelihood Model

Dirichlet-Multinomial likelihood Model
- sampling distribution: $\exists; \mathcal{D}_n = {X_1, \dots, X_n}, ;; X_1, \dots, X_n \sim \text{Bernoulli}(\theta)$ w/ $\hat{\theta} = S_n/n$
- prior distribution: Dirichlet prior
- the sample space of the multinomial w/ $K$ outcomes as the set of vertices of the $K$-dim hypercube $\mathbb{H}_K$, mad up of vectors w/ exactly only one 1 and the remaining elements 0
  
  [ x = \underbrace{(0, 0, \dots, 0, 1, 0, \dots, 0)^T}_{K\text{ places}} ]
- $\exists; \mathbf{X}i = (X{i1}, \dots, X_{iK})^T \in \mathbb{H}_K$,
  
  [ \underbrace{\theta \sim \text{Dirichlet}(\pmb{\alpha})}_{\text{Prior}} ;\text{ and }; \underbrace{\mathbf{X}i ,|, \theta \sim \text{Multinomial}(\pmb{\theta})}{\text{likeliehood}} ; \forall; i=1, 2, \dots, n]
  
  $\implies$ the posterior satisfies
  
  [ p(\pmb{\theta} ,|, \mathbf{X}1, \dots, \mathbf{X}n) \propto \mathcal{L}n(\theta)\pi(\theta) \propto \prod{i=1}^n \prod{j=1}^K \theta_j^{X{ij}} \prod_{j=1}^K \theta_j^{\alpha_j - 1} = \prod_{j=1}^K \theta_j^{\sum_{i=1}^n X_{ij}+\alpha_j-1} ]
- the posterior distribution w/ $\overline{\mathbf{X}} = \sum_{i=1}^n \mathbf{X}_i / n \in \Delta_K$
  
  [ \pmb{\theta} ,|, \mathbf{X}_1, \dots, \mathbf{X}_n \sim \text{Dirichlet}(\alpha+ n \overline{\mathbf{X}})]
- the posterior mean
  
  [ \mathbb{E}(\theta ,|, \mathbf{X}1, \dots, \mathbf{X}n) = \left(\frac{\alpha_1 + \sum{i=1}^n X{i1}}{\sum_{i=1}^K \alpha_i + n}, \dots, \frac{\alpha_K + \sum_{i=1}^n X_{iK}}{\sum_{i=1}^K \alpha_i + n} \right)^T ]
- prior conjugate w.r.t. the mode: the prior as Dirichlet distribution $\to$ the posterior as Dirichlet distribution

Gamma-Poisson likelihood model

Gamma-Poisson likelihood model
- Poisson model w/ rate $\lambda \geq 0$ in the sample space $\mathcal{X} = \mathbb{Z}_+ \text{ s.t. }$
  
  [ \mathbb{P}(X = x ,|, \lambda) = \frac{\lambda^x}{x!} e^{-\lambda} \propto \exp(x\log\lambda - \lambda) ]
- the natural parameter: $\theta = \log\lambda$
- the conjugate prior
  
  [ \pi_{x_0, n_0}(\lambda) \propto \exp(n_0x_0\log \lambda - n_0\lambda) ]
- a better parameterization of the prior as the $\text{Gamma}(\alpha, \beta)$
  
  [ \pi_{\alpha, \beta}(\lambda) \propto \lambda^{\alpha-1} (1-\lambda)^{-\beta\lambda} ]
- sampling distribution: $\exists; X_1, \dots, X_n$ observations from $\text{Poisson}(\lambda)$
- the posterior
  
  [ \lambda ,|, X_1, \dots, X_n \sim \text{Gamma}(\alpha + n\overline{\mathbf{X}},, \beta+n) ]
- the prior acts as if $\beta$ virtual observations were made, with a total count of $\alpha -1$ among them

Gamma-Exponential likelihood model

Gamma-Exponential likelihood model
- exponential distribution w/ the sample space $\mathcal{X} \in \mathbb{R}_+ \text{ s.t. }$
  
  [ p(x ,|, \theta) = \theta e^{-x\theta} ]
- exponential model widely used for survival times or waiting times btw events
- the conjugate prior: Gamma distribution in the most convenient parameterization
  
  [ \pi_{\alpha, \beta} \propto \theta^{\alpha - 1} e^{-\beta\theta} ]
- sampling distribution: $\exists; X_1, \dots, X_n$ observed data from $\text{Exponential}(\theta)$
- the posterior
  
  [ \theta ,|, X_1, \dots, X_n \sim \text{Gamma}(\alpha + n,, \beta + n\overline{X}) ]
- the prior acts if $\alpha -1$ virtual example are used, w/ a total waiting time of $\beta$

Gamma-Geometric likelihood model

Gamma-Geometric likelihood model
- the geometric distribution
  - the discrete analogue of the exponential model
  - sample space $\mathcal{X} = \mathbb{Z}_{++}$, the strictly positive integers
  - the density
  [ \mathbb{P}(X = x ,|, \theta) = (1-\theta)^{x-1} \theta ]
- the conjugate prior: $\text{Gamma}(\alpha, \beta)$
- sampling distribution: $\exists; X_1, \dots, X_n$ observed data from $\text{Geometric}(\theta)$
- the posterior
  
  [ \theta ,|, X_1, \dots, X_n \sim \text{Gamma}(\alpha + n, ,\beta + n\overline{X}) ]

InvGamma-Gaussian likelihood model

InvGamma-Gaussian likelihood model
- sampling distribution: $N(\mu, \sigma^2)$
- the likelihood function
  
  [\begin{align*} p(X_1, \dots, X_n ,|, \sigma^2) &\propto (\sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu^2) \right) \\ &= (\sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} n, \overline{(X - \mu)^2} \right) \ & \hspace{10em} \text{with }\left(\overline{(X-\mu)^2} = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 \right) \end{align*}]
- the conjugate prior
  - inverse Gamma distribution: $1/\theta \sim \text{Gamma}(\alpha, \beta)$
  - the density
    
    [ \pi_{\alpha, \beta}(\theta) \propto \theta^{-(\alpha+1)} e^{-\beta/\theta} ]
- the posterior distribution of $\sigma^2$
  
  [ \sigma^2 ,|, X_1, \dots, X_n \propto \text{InvGamma}\left(\alpha + \frac{n}{2},, \beta + \frac{n}{2}, \overline{X - \mu)^2},\right) ]

ScaledInv-$\chi^2$-Gaussian likelihood model

ScaledInv-$\chi^2$-Gaussian likelihood model
- the prior: scaled inverse $\chi^2$ distribution of $\sigma^2\nu_0Z;$ w/ $Z \sim \chi_{\nu_0}^2$
  
  [ \pi_{\nu_0, \sigma_0^2}(\theta) \propto \theta^{-(1+\nu_0/2)} \exp\left( -\frac{\nu_0 \sigma^2_0}{2\theta} \right) ]
- the posterior
  
  [ \sigma^2 ,|, X_1, \dots, X_n \sim \text{ScaledInv-}\chi^2 \left( \nu_0 +n, ,\frac{\nu_0 \sigma_0^2}{\nu_0 + n} + \frac{n, \overline{(X - \mu)^2}}{\nu_0 + n} \right) ]

InvWhishart-Gaussian likelihood Model

InvWhishart-Gaussian likelihood Model
- sampling distribution: $\exists; X_1, \dots, X_n$ observed data from $N(\mathbf{0}, \pmb{\Sigma}), ;\pmb{\Sigma} \in \mathbb{R}^{n\times n}$ as covariance (positive semi-defined matrix)
- the posterior: an inverse Wishart prior multiplies the likelihood
  
  [\begin{align*} &p(\mathbf{X}_1, \dots, \mathbf{X}n ,|, \pmb{\Sigma})\pi{\nu_0, \mathbf{S}_0} \\ & \hspace{3em}\propto |\pmb{\Sigma}|^{-n/2} \exp \left( -\frac{n}{2} \text{tr}(\overline{\mathbf{S}}\pmb{\Sigma}^{-1}) \right) |\pmb{\Sigma}|^{-(\nu_0+d+1)/2} \exp \left( -\frac{1}{2}\text{tr}(\mathbf{S}_0 \pmb{\Sigma}^{-1}) \right) \\ &\hspace{1em}= |\pmb{\Sigma}|^{-(n+\nu_0+d+1)/2} \exp \left( -\frac{1}{2} \text{tr}\left(\left(n \overline{\mathbf{S}} + \mathbf{S}_0\right) \pmb{\Sigma}^{-1}\right) \right) \end{align*}]
  - the empirical covariance: $\overline{\mathbf{S}} = \frac{1}{n} \sum_{i=1}^n \mathbf{X}_i\mathbf{X}_i^T$
- the posterior
  
  [ \pmb{\Sigma} ,|, \mathbf{X}_1, \dots, \mathbf{X}_n \sim \text{InvWishart}(\nu_0 + n,, \mathbf{S}_0 + n\overline{\mathbf{S}}) ]
- similarly, the conjugate prior for the inverse covariance $\pmb{\Sigma}^{-1}$ (precision matrix) is a Wishart

Pareto-Uniform likelihood model

Pareto-Uniform likelihood model
- uniform distribution: $\text{Uniform}(0, \theta),, \theta \geq 0$
- sampling distribution: $\exists; X_1, \dots, X_n$ observed data from $\text{Uniform}(0, \theta)$
- the prior of $\theta$: $\text{Pareto}(k, \nu_0)$
- let $X_{(n)} = \max_{1 \leq i \leq n} { X_i }$
  - $\nu_0 > X_{(n)} \implies$
    
    [ \mathcal{L}n(\theta) \pi{k, \nu_0}(\theta) = 0 ]
  - $\nu_o \leq X_{(n)} \implies$ the posterior ($\theta$ must be at least $X_{(n)}$)
    
    [ \mathcal{L}n(\theta) \pi{k, \nu_0}(\theta) \propto \frac{1}{\theta^n} \frac{1}{\theta^{k+1}} ]
- the posterior
  
  [ \theta ,|, X_1, \dots, X_n \sim \text{Pareto}\left(n + k, \max{X_{(n)}, ,\nu_0}\right) ]
- $n \nearrow ;\to$ the decay of the posterior $\nearrow \implies$ a more peaked distribution around $X_{(n)}$
- the parameter $K$ controls the sharpness of the decay for small $n$

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stat-Conjugate.md

Stat-Conjugate.md

Conjugate Priors

Overview of Conjugate Priors

Conjugate Priors with Exponential Family

Uniform-Bernoulli likelihood model

Beta-Bernoulli likelihood model

Dirichlet-Multinomial likelihood Model

Gamma-Poisson likelihood model

Gamma-Exponential likelihood model

Gamma-Geometric likelihood model

InvGamma-Gaussian likelihood model

ScaledInv-$\chi^2$-Gaussian likelihood model

InvWhishart-Gaussian likelihood Model

Pareto-Uniform likelihood model

Files

Stat-Conjugate.md

Latest commit

History

Stat-Conjugate.md

File metadata and controls

Conjugate Priors

Overview of Conjugate Priors

Conjugate Priors with Exponential Family

Uniform-Bernoulli likelihood model

Beta-Bernoulli likelihood model

Dirichlet-Multinomial likelihood Model

Gamma-Poisson likelihood model

Gamma-Exponential likelihood model

Gamma-Geometric likelihood model

InvGamma-Gaussian likelihood model

ScaledInv-$\chi^2$-Gaussian likelihood model

InvWhishart-Gaussian likelihood Model

Pareto-Uniform likelihood model