5 Bayesian inference

In a radical, but rather useless, point of view it can be said that, once the posterior has been obtained, the work of Bayesian statistics is done.

However, many practical problems still need our attention:

  1. Model checking and sensitivity analysis

  2. How to answer typical statistical questions?

    • evaluate general hypothesis
    • establish or discard associations
    • predict unobserved data
    • . . .

How to deal with arbitrary distributions defined on sets that can have very high dimensions?

  1. the full posterior distribution;

  2. marginal posterior distributions (univariate or multivariate);

  3. posterior predictive distributions (more about these later).

5.1 Summarizing posterior inference

Point summaries

Numerical summaries of typically relevant characteristics of posterior distributions: location, dispersion, asymmetry, correlation, etc.

Usual choices for location

1. Posterior mode(s)

\[\hat{\theta}=\arg\max_{\theta\in\Theta}h(\theta\mid\mathbf{x})=\arg\max_{\theta\in\Theta}f(\mathbf{x}\mid\theta)h(\theta)\]

Note If \(h(\theta)\) is constant then the posterior mode is the MLE of \(\theta\).

2. Posterior mean

\[\hat{\theta}=E[\theta\mid\mathbf{x}],\; E[\theta_i\mid\mathbf{x}]=\int_{\Theta}{\theta_i h(\theta\mid\mathbf{x})\,d\theta}\]

Note Posterior moments may not exist!

How to choose between these and other possible summaries?

Set summaries

A more informative way to present the posterior uncertainty is to provide set summaries of posterior distributions.

\(R\subset \Theta\) is called a credible set for \(\theta\) with credibility \(\gamma\in ]0,1[\) if \[P(\theta\in R\mid\mathbf{x})=\gamma.\]

Note

  • Credibility \(\equiv\) posterior probability

  • For a given \(\gamma\) there is usually an infinity of credible sets!

A set \(R\subset \Theta\) is said to be an highest posterior density set if \[h(\theta_0\mid\mathbf{x})\geq h(\theta_1\mid\mathbf{x}),\;\forall \theta_0\in R,\; \forall \theta_1\not\in R.\]

Note

  • An HPD set is the Bayesian counterpart of a frequentist confidence set of minimum volume (area, length).

  • The HPD set with fixed credibility \(\gamma\) is unique if \(h(\theta\mid\mathbf{x})\) is not constant in any subset of \(\Theta\).

For unimodal posterior densities with a mode in the interior of \(\Theta\):

If \(R\) is an HPD set then \(h(\theta\mid\mathbf{x})\) must be constant in its boundary.

So, \(R\subset \Theta\) is an HPD credible set for \(\theta\) with credibility \(\gamma\in]0,1[\) if \[P(\theta\in R\mid\mathbf{x})=\gamma\] and \[R=\{\theta\in\Theta : h(\theta\mid\mathbf{x})\geq c_{\gamma} \}.\]

Let \((X_1,\ldots,X_n)\) be a random sample from the model \[\mathcal{F}=\{N(\mu,\sigma^2):\mu\in \mathbb{R},\;\sigma^2\;\mathrm{known}\}.\]

  1. Show that the product of two Gaussian densities is proportional to another Gaussian density.

    Use the following identity:

    \(a_1(x-b_1)^2+a_2(x-b_2)^2=\) \(=(a_1+a_2)(x-b)^2+\frac{a_1a_2}{a_1+a_2}(b_1-b_2)^2\)

    where \(b=\frac{a_1b_1+a_2b_2}{a_1+a_2}.\)

  2. Show that \(f(\mathbf{x}\mid\mu)\propto N(\bar{x},\sigma^2/n)\) (as a function of \(\mu\)).

  1. Find the posterior distribution for \(\mu\) given \(\mu\sim N(a,b^2)\).

  2. Identify a WIP distribution for \(\mu\) that can be obtained from the conjugate family and use it to derive the expression of the HPDCI for \(\mu\) with an arbitrary credibility \(\gamma\).

Testing hypothesis

The evaluation of \(H_0:\theta\in\Theta_0\) against \(H_1:\theta\in\Theta_1\) can usually be done comparing posterior probabilities.

The ratio \(O(H_0,H_1\mid \mathbf{x})=\dfrac{P(H_0\mid\mathbf{x})}{P(H_1\mid\mathbf{x})}\) is called the posterior odds for \(H_0\).

In order to analyse how the observed data may change the prior opinion it is also used the Bayes factor.

The odds ratio \(B(H_0,H_1)=\dfrac{O(H_0,H_1\mid \mathbf{x})}{O(H_0,H_1)}\) is called the Bayes factor for \(H_0\).

Note

\(\log B(H_0,H_1)=\log O(H_0,H_1\mid \mathbf{x}) -\log O(H_0,H_1)\)

Let \((X_1,\ldots,X_n)\) be a random sample from the model \[\mathcal{F}=\{N(\mu,\sigma^2):\mu\in \mathbb{R},\;\sigma^2\;\mathrm{known}\}.\]

With \(h(\mu)\propto c\) we have \(\mu\mid \mathbf{x}\sim N(\bar{x},\sigma^2/n)\).

Determine the posterior odds for \(H_0:\mu\leq \mu_0\) against \(H_1:\mu> \mu_0\).

What about the Bayes factor?

Note

  1. The Bayes factor requires a proper prior distribution.

  2. There is no formal distinction between \(H_0\) and \(H_1\).

Multiple comparisons can easily be done!

Suppose we have \(k>2\) hypothesis, \(H_i\), \(i=1,\ldots,k\). Let \(B_{i,j}=B(H_i,H_j)\). How many Bayes factors do we need?

Note that:

  • \(B_{i,i}=1\)

  • \(B_{i,j}=B_{j,i}^{-1}\)

  • \(B_{i,j}=B_{i,k}\times B_{k,j},\;\forall i,j,k\).

If \(\theta\) is assigned a continuous distribution, the Bayes factor cannot be used to evaluate \(H_0:\theta=\theta_0\) against \(H_1:\theta\neq\theta_0\)!

  1. The problem is misspecified! If \(\theta\) is treated as continuous why should we care about \(\theta=\theta_0\)?

    The reasonable question should be:

    Is the posterior distribution of \(\theta\) concentrated around \(\theta_0\)?

  1. Wait! We really care about \(\theta=\theta_0\)!

    Then, the prior distribution should reflect the importance of \(\theta=\theta_0\).

    \[h(\theta)=\begin{cases}\pi_0,& \theta=\theta_0\\ (1-\pi_0)h_1(\theta),& \theta\neq\theta_0\end{cases}\]

The observation of 20 independent Bernoulli trials produced 15 successes. Use this information to evaluate \(H_0:\theta=0.5\) against \(H_1:\theta\neq 0.5\) considering:

  1. an uniform prior for \(\theta\);

  2. that, a priori, both hypothesis are equally likely.

5.2 Prediction

Frequently, a statistical model is not an end in itself but instead it is a way to predict unobserved data.

Suppose that \(\mathbf{y}\) is a set of unobserved data with a distribution \(g(\mathbf{y}\mid \mathbf{x},\theta)\). Then all predictions can be obtained from:

\[h(\mathbf{y}\mid \mathbf{x})=\int_{\Theta}{g(\mathbf{y}\mid \mathbf{x},\theta)h(\theta\mid\mathbf{x})\,d\theta},\] the posterior predictive distribution.

Note

Often, \(\mathbf{y}\) is a new data set from the same sampling model independent of \(\mathbf{x}\) and so \(g(\mathbf{y}\mid \mathbf{x},\theta)=f(\mathbf{y}\mid \theta)\).

Let \((X_1,\ldots,X_n)\) be a random sample from the model \[\mathcal{F}=\{N(\mu,\sigma^2):\mu\in \mathbb{R},\;\sigma^2\;\mathrm{known}\}.\]

With \(h(\mu)\propto c\) we have \(\mu\mid \mathbf{x}\sim N(\bar{x},\sigma^2/n)\).

Find the posterior predictive distribution of the mean of a new independent sample from the model.

Consider a conjugate analysis of the data \(Y_i\mid\theta\stackrel{iid}{\sim}Ber(\theta)\). For a new observation of the model independent from the initial data:

  1. find its posterior predictive mean;

  2. obtain its posterior predictive distribution.

5.3 Computation

Except for the simplest models, posterior distributions have high dimensions and are too complicated to enable any meaningful analytical treatment.

In particular, integration is an ubiquitous problem in any Bayesian analysis:

Marginal likelihood: \[g(\mathbf{x})=\int_{\Theta}{f(\mathbf{x}\mid\theta)h(\theta)\,d\theta}\]

Posterior moments: \[E[g(\theta)\mid\mathbf{x}]=\int_{\Theta}{g(\theta)h(\theta\mid\mathbf{x})\,d\theta}\]

Posterior probabilities: \[P(\theta\in A\mid \mathbf{x})=E[I_{A}(\theta)\mid\mathbf{x}]\]

Marginal posterior distributions: \(\theta=(\theta_1,\theta_2)\), \[h(\theta_1\mid \mathbf{x})=\int_{\Theta_2}{h(\theta\mid\mathbf{x})\,d\theta_2}\]

Posterior predictive distributions and predictive summaries: \[h(\mathbf{y}\mid \mathbf{x})=\int_{\Theta}{f(\mathbf{y}\mid \theta)h(\theta\mid\mathbf{x})\,d\theta}=E_{\theta}[f(\mathbf{y}\mid \theta)\mid \mathbf{x}]\]

Notorious exceptions

(A) Distributional approximations

Several different ways to approximate the posterior distribution, usually using Gaussian approximations.

\(\longrightarrow\) Laplace approximation to integrals

\[\int_a^b{e^{Mf(x)}\,dx}\approx e^{Mf(x_0)}\sqrt{\frac{2\pi}{M\left|f''(x_0)\right|}},\]

where \(x_0=\arg\max_{x\in]a,b[} f(x)\) and \(M\) is a large number.

\(\longrightarrow\) INLA – Integrated Nested Laplace Approximation (algorithm and software for Bayesian computation in a wide variety of models)

\(\longrightarrow\) Variational inference

(B) Numerical integration

  1. Deterministic methods (numerical quadrature)

  2. Stochastic methods, such as the Monte-Carlo integration

    \[E[g(\theta)\mid\mathbf{x}]=\int_{\Theta}{g(\theta)h(\theta\mid\mathbf{x})\,d\theta}\approx \frac{1}{k}\sum_{i=1}^k{g(\theta^i)},\]

    where \(\left\{\theta^1,\ldots,\theta^k\right\}\) is a random sample generated from \(h(\theta\mid\mathbf{x})\).

(C) Other types of simulation

  1. Importance sampling

  2. Rejection sampling

  3. Markov chain Monte Carlo (MCMC)

Markov chains

Stochastic process: \(\left\{X_t,\ t=1, 2, \ldots\right\}\) with \(x_0\) – initial state

\(p(x_1,\ldots, x_n\mid x_0)=\)

\(=p_1(x_1\mid x_0)p_2(x_2\mid x_1, x_0)\ldots p_n(x_n\mid x_{n-1},\ldots, x_0)\)

Markovian property

\(p(x_1,\ldots, x_n\mid x_0)=p_1(x_1\mid x_0)p_2(x_2\mid x_1)\ldots p_n(x_n\mid x_{n-1})\)

Under certain conditions \(X_t \stackrel{t\rightarrow +\infty}{\longrightarrow} X\) and \(p(x)\) is said the stationary distribution.

MCMC methods

  1. Build a discrete time Markov chain, with a state space \(\Theta\) such that \(h(\theta\mid x)\) is the unique stationary distribution (and a few other properties);
  2. Generate successive values \(\theta^{(t)}\) from \(p_t\) for a sufficiently long time.

To conduct a MCMC analysis we need a probabilistic programming language: a programming language designed to describe probabilistic models and perform inferences.

It can be:

  1. an extension of a generic programming language, or

  2. a specialised language

Some available software

  1. BUGS (1993, Component Pascal)
  2. WinBUGS (2000, Component Pascal, GUI)
  3. JAGS (2002, C++)
  4. OpenBUGS (2004, Component Pascal, GUI)
  5. STAN (2013, C++)
  6. PyMC3 (2013, Python)
  7. TensorFlow Probability (2018, Python)
  8. . . .
pages
grid_view zoom_out zoom_in swap_horiz autorenew visibility_off
menu
fullscreen
stylus_note ink_eraser delete_forever content_paste navigate_before navigate_next
draw