4 The Bayesian choice

Back to the beginning . . .

Core features of frequentist Statistics

  1. Repeated sampling \(\implies\) frequentist probabilities

  2. Key role of the sample space

  3. Inference is not probabilistic (likelihood, confidence, significance, . . . )

4.1 There’s no theorem like Bayes’ theorem

Philosophical Transactions, January 1, 1763, 53 370-418.

Bayes’ theorem Let \(\mathcal{M}=\left\{M_1, M_2,\ldots\right\}\) be a partition of a sample space and \(D\) an event with \(P(D)>0\). Then, \[\begin{align} P(M_i\mid D)&=\frac{P(D\mid M_i)P(M_i)}{P(D)}\\ &=\frac{P(D\mid M_i)P(M_i)}{\sum_j{P(D\mid M_j)P(M_j)}}. \end{align}\]

Bayes’ theorem under a new perspective

\[P(M_i\mid D)=\frac{P(D\mid M_i)P(M_i)}{\sum_j{P(D\mid M_j)P(M_j)}}\]

\[P(M_i\mid D)=\frac{P(D\mid M_i)P(M_i)}{\sum_j{P(D\mid M_j)P(M_j)}}\]

In this new context:

\[P(M_i\mid D)=\frac{P(D\mid M_i)P(M_i)}{\sum_j{P(D\mid M_j)P(M_j)}}\]

So, Bayes’ theorem provides a way to update the

with the

leading to:

Therefore, Bayes’ theorem

So far, Bayes’ theorem involved two discrete distributions on \(\mathcal{M}\).

The continuous version is

\[f(M\mid D)=\frac{f(D\mid M)f(M)}{f(D)}=\frac{f(D\mid M)f(M)}{\int_{\mathcal{M}} f(D\mid M')f(M')\,dM'}\]

What can this \(\mathcal{M}\) be in a statistical context?

  1. a family of continuous distributions (nonparametric statistics)

  2. \(\mathcal{M}=\{F(x\mid \theta):\theta\in\Theta\}\) (parametric statistics)

  3. a family of regression models (parametric statistics and model comparison)

  4. \(\ldots\)

What type of probability is required?

Under the frequentist approach to Statistics, probability is assigned only to entities that can be repeatedly observed.

In a Bayesian perspective, probability is assigned to all entities affected by uncertainty.

Under this new approach to Statistics:

Probability is the measure of all uncertainty!

Bayesian statistics relies on the subjective interpretation of probability under which probabilities express a state of knowledge or a personal belief ruled by principles of rationality and coherence.

Note

All interpretations of the idea of probability are idealized constructions that require some degree of judgement and are, in this sense, subjective.

4.2 Parametric bayesian statistics

To the probability model adopted in frequentist statistics,

\[(\mathcal{X}, \mathcal{A}, \mathcal{F}),\;\mathcal{F}=\{F(x\mid \theta):\theta\in \Theta\},\]

it is now added a second component

\[(\Theta, \mathcal{B}, \mathcal{H}),\;\mathcal{H}=\{H(\theta):\theta\in \Theta\}.\]

A distribution defined on \(\Theta\) is called the prior distribution of \(\theta\).

Applying Bayes’ theorem we get the posterior distribution of \(\theta\),

\[h(\theta\mid x)=\frac{f(\mathbf{x}\mid\theta)h(\theta)}{g(\mathbf{x})},\]

where \[g(\mathbf{x})=\int_{\Theta}{f(\mathbf{x}\mid\theta)h(\theta)\,d\theta}\]

is called the marginal likelihood (or prior predictive distribution) and \(f(\mathbf{x}\mid\theta)h(\theta)\) is called the unnormalized posterior.

The posterior distribution represents all current knowledge about \(\theta\) and therefore it will be the source of all statistical inferences.

Bayesian modelling

Bayesian models do not require the common frequentist framework of random sampling from some population.

Instead, usually it is only required that the observable variables are exchangeable.

The random variables \(X_1,X_2, \ldots\) are said infinitely exchangeable if the distributions of any finite subsequence \((X_1,\ldots,X_n)\) and \((X_{\pi(1)},\ldots,X_{\pi(n)})\) are the same for any permutation \(\pi\) of \(\{1,\ldots,n\}\).

Note

Example – Hospital quality of care

Context: evaluation of the quality of care offered by a certain hospital.

As part of the evaluation program we will examine the medical records of all patients treated in the hospital in one given period of time for a particular medical condition – Acute Myocardial Infarctum (AMI).

To keep things simple we’ll ignore the patients condition at admission and the medical care received and focus on one particular outcome: mortality as of 30 days from hospital admission (1\(\equiv\) dead, 0\(\equiv\) alive).

What can be said about the sequence of 0’s and 1’s that will be collected?

This is not a sample from any population!

  • Random variables are not required to describe some process of repeatable random sampling;

  • Instead, random variables will serve to quantify the uncertainty about the observable quantities.

So, \[P(X_1=x_1,\ldots,X_{n}=x_{n})=?\]

An exchangeability argument

If no relevant information distinguishes patients, our uncertainty about the sequence of 0’s and 1’s should be symmetrical, in the sense that it should remain unchanged under any permutation of the order of the patients.

De Finetti’s 0-1 representation theorem

If \(X_1,X_2,\ldots\) is a infinitely exchangeable sequence of 0-1 random quantities, then there exists a random variable \(\theta\) with density function \(g(\theta)\) such that \(\frac{\sum{X_i}}{n}\stackrel{P}{\rightarrow}\theta\) and \[f(x_1,\ldots, x_n)= \int_0^1{\left(\prod_{i=1}^n{\theta^{x_i}(1-\theta)^{1-x_i}}\right)g(\theta)\,d\theta}.\]

\[f(x_1,\ldots, x_n)= \int_0^1{\left(\prod_{i=1}^n{\theta^{x_i}(1-\theta)^{1-x_i}}\right)g(\theta)\,d\theta}\]

In other words, given exchangeability, it is as if:

  1. there is a random quantity \(\theta\) interpretable as the limiting relative frequency of 1’s;

  2. conditional on \(\theta\) the \(X_i\)’s are IID \(Ber(\theta)\) random variables;

  3. \(\theta\) itself has an unknown density \(g(\theta)\).

A first Bayesian hierarchical model

Under exchangeability:

\(\theta \sim h(\theta)\) – the prior distribution

\(X_i\mid\theta \stackrel{iid}{\sim}Ber(\theta)\) – the likelihood

\(\stackrel{\text{Bayes theorem}}{\implies} \theta\mid\mathbf{x} \sim h(\theta\mid\mathbf{x})\) – the posterior distribution

Note

  1. De Finetti’s representation theorem:

    exchangeability \(\iff\) conditional independence.

  1. There are other representation theorems and other arguments to derive bayesian models in more general situations.

  2. Under a Bayesian approach we don’t claim that \(\theta\) is a random variable. We just treat it as such.

  3. Bayes’ theorem plays an instrumental role in updating the prior distribution with the likelihood and producing the posterior distribution.

Prior elicitation

Suppose that a pool of physicians agreed that:

  1. the 30-day AMI mortality rate in the hospital region should be about 15%, which can be expressed as \(E[\theta]=0.15\);

  2. it would be quite surprising if the mortality rate is less than 5% or more than 30%, which we’ll express as \(P(0.05<\theta<0.30)=0.95\).

Note This elicitation can be seen as a “translation” of expert opinions into probabilistic language.

How to put this prior information in the form of a density?

A simple solution: seek for such density in a flexible family of densities on \(]0,1[\) – the beta family of distributions.

Numerically, or by trial and error, we get to \(\theta\sim B(a, b)\) with \(a=4.5\) and \(b=25.5\).

We will call \(a\) and \(b\) the hyperparameters.

The posterior distribution

Data: \(n=200\) with \(S=36\) ones and \(n-S=164\) zeros

With \(\theta\sim B(a = 4.5, b = 25.5)\), applying Bayes’ theorem we get

\[\begin{align*} h(\theta\mid x) & = \dfrac{\theta^{36}(1-\theta)^{164}\times \theta^{3.5}(1-\theta)^{24.5}}{\int_0^1{\theta^{39.5}(1-\theta)^{188.5}\,d\theta}}\\[0.5cm] & \propto \theta^{39.5}(1-\theta)^{188.5} \end{align*}\]

\(\therefore \theta\mid x \sim B(40.5, 189.5)\equiv B(a+S, b+n-S)\)

Note This is a particular type of Bayesian analysis called a conjugate analysis.

4.3 The main characteristics of Bayesian statistics

  1. Any factor of \(f(\mathbf{x}\mid\theta)\) that does not depend on \(\theta\) is irrelevant.

    \(\implies\) compliance to the Likelihood Principle

    \(\implies\) compliance to the Sufficiency Principle

  1. Bayesian statistics has a sequential nature.

    For \(\mathbf{X}=(\mathbf{X}_1,\mathbf{X}_2)\) with \(\mathbf{X}_1\) independent of \(\mathbf{X}_2\) given \(\theta\),

    \[h(\theta\mid \mathbf{x})=\frac{f(\mathbf{x}_2\mid\theta)h(\theta\mid \mathbf{x}_1)}{\int_{\Theta}{f(\mathbf{x}_2\mid\theta)h(\theta\mid \mathbf{x}_1)\,d\theta}}\]

  1. Bayesian statistics provides a single and easy way to deal with nuisance parameters.

    Suppose \(\theta=(\gamma, \phi)\in\Gamma\times \Phi\) where \(\phi\) is a nuisance parameter. To analyse \(\gamma\) just use \[h(\gamma\mid \mathbf{x})=\int_{\Phi}{h(\gamma,\phi\mid \mathbf{x})\,d\phi}.\]

Particular case: suppose \(\gamma\) and \(\phi\) are independent a priori and let \(U\) be a partial ancillary statistic for \(\gamma\). Then

\[h(\gamma\mid \mathbf{x})\propto h_1(\gamma) f(\mathbf{x}\mid u, \gamma)\]

\(\implies\) compliance to the Conditionality Principle (under prior independence)

  1. In Bayesian statistics, sufficiency is a mere convenience.

    If \(T\) is a sufficient statistic then \(h(\theta\mid \mathbf{x})\) depends on \(\mathbf{x}\) only through \(t=T(\mathbf{x})\), that is, \(h(\theta\mid \mathbf{x})=h(\theta\mid t)\) where \(h(\theta\mid t)\) is obtained from \(f(t\mid\theta)\).

For \(\theta \sim h(\theta)\) and \(X_i\mid\theta \stackrel{iid}{\sim}Ber(\theta)\):

\[h(\theta\mid \mathbf{x})\propto h(\theta) \theta^{\sum{x_i}}(1-\theta)^{n-\sum{x_i}}\]

\(T=\sum{X_i}\) is a sufficient statistic for \(\theta\) with \(T\mid\theta\sim Bi(n,\theta)\)

\[h(\theta\mid t)\propto h(\theta) \binom{n}{t}\theta^{t}(1-\theta)^{n-t}\]

4.4 The prior distribution

  • Among several different approaches to the prior selection problem we will consider only a couple:
    1. weakly informative priors;
    2. conjugate priors.

Weakly informative priors

Weakly informative priors play an important role:

  1. Frequently there’s no tangible prior information (a state of prior ignorance) or that information is scarce (a state of vague or diffuse prior knowledge);

  2. They can be used to produce a reference analysis that can be compared to a frequentist analysis that formally only uses the data information.

Frequently, this prior distributions are referred to as noninformative priors.

However, noninformative priors don’t exist in the sense that we can’t be ignorant about everything.

For \(X_i\mid\theta \stackrel{iid}{\sim}Ber(\theta)\) it can be argued that \(\theta\sim U(0,1)\) expresses prior ignorance about \(\theta\).

But, if we are interested in \(\phi=\theta^2\) or \(\gamma=\frac{\theta}{1-\theta}\), can that prior distribution represent ignorance in some sense?

Some arguments leading to weakly informative priors

1. The Bayes-Laplace principle of insufficient reason

\[\theta\sim \text{ uniform distribution on } \Theta\]

If \(\Theta\) is bounded, this seems a reasonable choice.

Otherwise, this is not a proper distribution!

In this case, we can write \(h(\theta)\propto c\) in \(\Theta\) without specifying the constant \(c\).

This is called an improper distribution that leads to:

\[h(\theta\mid\mathbf{x})\propto f(\mathbf{x}\mid\theta)\]

Improper distributions can be:

  1. justified in the context of measure theory (as the Dirac’s delta function, for example);

  2. interpreted as limit-cases of proper distributions;

  3. used in a Bayesian analysis with care – the posterior distribution must be a proper distribution.

Note The use of uniform prior distributions relates to Fisher’s idea of “fiducial inference”, an attempt to perform “inverse probability” without the use of prior distributions.

2. Jeffrey’s invariance principle

  1. \(\theta\) is a location parameter \((\Theta=\mathbb{R})\)

    Invariance to location transformations \(\implies h(\theta)\propto c\)

  2. \(\theta\) is a scale parameter \((\Theta=\mathbb{R}^+)\)

    Invariance to scale transformations \(\implies h(\theta)\propto \theta^{-1}\)

    Note: \(\eta=\log\theta\) is a location parameter for the log-transformed data and so the previous case applies.

  1. \(\theta\) is a generic scalar parameter

    Invariance to one-to-one transformations \(\implies h(\theta)\propto [I(\theta)]^{1/2}\)

  2. \(\theta\) is a generic vector parameter

    1. \(h(\theta)\propto \left|\mathbf{I}(\theta)\right|^{1/2}\);

    2. under prior independence, use the previous rules for each scalar component of \(\theta\).

Problems

  • as before, improper priors can arise;

  • cases 3. and 4. are not fully Bayesian procedures (why?).

3. Current practice

  1. Any automatic selection of prior distributions is discouraged;

  2. More pragmatic tendency to use reasonable proper prior distributions that don’t dominate the posterior distribution;

Conjugate priors

A family \(\mathcal{H}=\{H(\theta):\theta\in\Theta\}\) is a conjugate family for

\[\mathcal{F}=\{F(x\mid\theta):\theta\in\Theta\}\]

if and only if

\[h(\theta)\in\mathcal{H}\implies h(\theta\mid x)\in\mathcal{H}.\]

Equivalently, \(\mathcal{H}\) is a conjugate family for \(\mathcal{F}\) if and only if:

  1. \(\exists h\in\mathcal{H}:f(\mathbf{x}\mid\theta)\propto h\;\forall\mathbf{x}\in\mathcal{X}\);

  2. \(\mathcal{H}\) is closed under multiplication.

\(\theta \sim B(a, b)\)

\(X_i\mid\theta \stackrel{iid}{\sim}Ber(\theta)\)

\(\implies \theta\mid \mathbf{x}\sim B(a+S, b+n-S)\) where \(S=\sum{x_i}\)

Interpretation

\[(0, 0)\longrightarrow (a, b)\longrightarrow (a+S, b+n-S)\]

The prior \(B(a,b)\) can be interpreted as the posterior distribution obtained from a vague and improper prior distribution “\(B(0 ,0)\)” updated by a pseudo-sample of size \(a+b\) with \(a\) successes.

Conjugate priors:

  1. can provide an easy interpretation of the way information is updated through a Bayesian analysis;

  2. can be useful to justify new weakly informative priors (usually improper);

  3. are a mathematical convenience.

Theorem If \(\mathcal{F}\) admits a sufficient statistic of fixed dimension for any sample size then a conjugate family for \(\mathcal{F}\) exists.

Remarkable cases

  1. the exponential family

  2. the non-regular family defined by \[f(x\mid \theta)=c(\theta)h(x)I_{]L(\theta),U(\theta)[}(x),\] where \(L(\theta)\) or \(U(\theta)\) (but not both) can be constant.

  1. Let \((X_1,\ldots,X_n)\) be a random sample from the model \[\mathcal{F}=\{U(0,\theta):\theta\in \mathbb{R}^+\}.\]

    1. Show that the Pareto family of distributions defined by

      \[h(\theta\mid a, b)=b a^b \theta^{-(b+1)}I_{(a,+\infty)}(\theta),\;a,b>0,\]

      is a conjugate family for the data model.

    2. Find the Jeffrey’s prior for \(\theta\) and use it to interpret the way information is updated in the conjugate analysis.

  1. Let \((X_1,\ldots,X_n)\) be a random sample from the model \[\mathcal{F}=\{Poi(\lambda):\lambda\in \mathbb{R}^+\}.\]

    1. Find a conjugate family for the data model.

    2. Using the vague improper prior from the conjugate family show that \(2n\lambda\mid\mathbf{x}\sim \chi_{(2t)}^2\), with \(t=\sum{x_i}>0.\)

pages
grid_view zoom_out zoom_in swap_horiz autorenew visibility_off
menu
fullscreen
stylus_note ink_eraser delete_forever content_paste navigate_before navigate_next
draw