Mathematical Statistics

1st semester 2025/2026

1 Principles of data reduction

1.1 The frequentist approach to Statistics

The context

What can this \(\mathcal{F}\) be?

\(\mathcal{F}=\{\)continuous distributions\(\}\)

\(\ldots\)

\(\mathcal{F}=\{F(x\mid \theta):\theta\in \Theta\}\), where \(\Theta\) is the parameter space

Some usual questions

  1. Does the data contradict \(\mathcal{F}\) or \(\mathcal{F}_0\subset\mathcal{F}\)? \(\leadsto\) Hypotheses testing
  1. Assessed the validity of \(\mathcal{F}_{\theta}\), can we refine our initial model choice? \(\leadsto\) Point or set estimation
  1. Can we use \(\mathcal{F}\) to predict unobserved data? \(\leadsto\) Prediction

A frequent scenario

\(\leadsto\) Parametric inference

The main features of Frequentist Statistics

  1. Formally, data is the only source of information;
  1. Statistical procedures are evaluated according to the frequentist concept of probability that requires the possibility of an ilimited number of hypotetical repetitions of the sampling experiment (principle of repeated sampling);

  2. It is not possible to assess the post-experimental precision of statistical results.

What is the sample information about \(F\) (or \(\theta\))?

  1. The observed values \(\mathbf{x}=(x_1,\ldots, x_n)\);

  2. The sampling distribution

    \[f(\mathbf{x}\mid\theta)=\prod_{i=1}^n{f(x_i\mid\theta)}.\]

How do we use the data?

Frequentist statistics rely heavily on the use of . . . statistics \[T(\mathbf{X}):\mathcal{X}\longrightarrow \mathbb{R}^k,\ 1\leq k \leq n\]

used as

  • point estimators;
  • a starting point to find pivotal quantities;
  • test statistics;
  • residuals . . .

Since, usually, \(k\ll n\), a statistic provides a reduction of the data and, potentially, some loss of information.

Do we always lose information by using any statistic?

1.2 The sufficiency principle

A statistics \(T=T(\mathbf{X})\) is said sufficient for \(\theta\) if and only if the distribution of \(\mathbf{X}\mid T=t\) does not depend on \(\theta\).

Another view on sufficiency

\[\mathcal{X} \xrightarrow[E]{f(\mathbf{x}\mid \theta)} \mathbf{x}\]

\[\mathbb{R}^k \xrightarrow[E_1]{f(\mathbf{t}\mid \theta)} t \xrightarrow[E_2]{f(\mathbf{x}\mid t,\theta)} \mathbf{x}:T(\mathbf{x})=t\]

If \(T\) is a sufficient statistic then the experiment \(E_2\) does not provide any further information about \(\theta\) and, therefore, is irrelevant for inferential purposes.

Does a sufficient statistic always exist?

Let \((X_1,\ldots,X_n)\) be a random sample from \[\mathcal{F}=\{F(x\mid \theta):\theta\in \Theta\}.\]

  1. \((X_1,\ldots,X_n)\) is a (trivial) sufficient statistic for \(\theta\);

  2. \((X_{(1)},\ldots,X_{(n)})\) is also a sufficient statistic for \(\theta\).

Let \((X_1,\ldots,X_n)\) be a random sample from \[\mathcal{F}=\left\{Ber(\theta):\theta\in ]0,1[\right\}.\]

Determine whether \(T_1=X_1+X_n\) and \(T_2=\sum_{i=1}^n{X_i}\) are sufficient statistics for \(\theta\).

How to find sufficient statistics?

\[f(\mathbf{x}\mid t, \theta) = \frac{f(\mathbf{x}, t\mid \theta)}{f(t\mid\theta)}=\begin{cases} 0, & \mathbf{x}:T(\mathbf{x})\neq t\\ \frac{f(\mathbf{x}\mid \theta)}{f(t\mid\theta)}, & \mathbf{x}:T(\mathbf{x})=t\end{cases}, \forall t\]

So, \(T\) is a sufficient statistic for \(\theta\) if and only if \(\frac{f(\mathbf{x}\mid \theta)}{f(t\mid\theta)}\) does not depend on \(\theta\), \(\forall \mathbf{x}\in\mathcal{X}\).

Factorization theorem A statistic \(T(\mathbf{X})\) is sufficient for \(\theta\) if and only if \[f(\mathbf{x}\mid\theta)= g\left(T(\mathbf{x}),\theta\right)\,h(\mathbf{x}),\,\forall \mathbf{x}\in\mathcal{X},\,\forall \theta\in\Theta,\] for some non-negative functions \(g\) and \(h\).

Find a sufficient statistic for each of the following models:

  1. \(\mathcal{F}_1=\{N(\mu,\sigma^2):\mu\in \mathbb{R},\; \sigma^2\in\mathbb{R}^+\}\)

  2. \(\mathcal{F}_2=\{U(\theta-1/2,\theta+1/2):\theta\in \mathbb{R}\}\)

Are all sufficient statistics equal?

\(\Pi_T=\left\{\pi_t\right\}_{t\in T(\mathcal{X})}\) where \(\pi_t=\{\mathbf{x}\in \mathcal{X}: T(\mathbf{x})=t\}\), is the partition of \(\mathcal{X}\) induced by \(T(\mathbf{X})\).

\(T(\mathbf{X})\) is equivalent to \(U(\mathbf{X})\) if and only if \(\Pi_T=\Pi_U\).

Let \(T(\mathbf{X})\) be a statistic and \(\Pi_T\) its partition of \(\mathcal{X}\). Consider another statistic \(U(\mathbf{X})\) that is some function of \(T(\mathbf{X})\).

Discuss how \(\Pi_T\) and \(\Pi_U\) can be related.

Given two statistics, \(T(\mathbf{X})\) and \(U(\mathbf{X})\), \(\Pi_T\) is nested in \(\Pi_U\) if and only if \[\forall \pi\in \Pi_T\, \exists \pi^*\in \Pi_U: \pi\subset \pi* .\]

Let \((X_1,\ldots,X_n), \,n>2\) be a random sample from \(\mathcal{F}\) and consider the statistics:

\(T_0=(X_1,\ldots,X_n)\)

\(T_1=(X_{(1)},\ldots,X_{(n)})\)

\(T_2=(X_{(1)},\,X_{(n)})\)

\(T_3=X_{(n)}\)

If \(T(\mathbf{X})\) is sufficient for \(\theta\) what characterizes \(\Pi_T\)?

Sufficient statistic \(\equiv\) sufficient partition

Theorem If \(T(\mathbf{X})\) is sufficient for \(\theta\) then any other statistic \(U(\mathbf{X})\) such that \(T=g(U)\) is also sufficient for \(\theta\).

A statistic is called a minimal sufficient statistic if and only if it is sufficient and a function of any other sufficient statistic.

Lehmann & Scheffé’s method If there exists a statistic \(T(\mathbf{X})\) such that \(\forall (\mathbf{x},\mathbf{y})\in(\mathcal{X}\backslash \Pi_0)^2\), where \(\Pi_0=\{\mathbf{x}\in\mathcal{X}: f(\mathbf{x}\mid\theta)=0, \forall \theta\in\Theta\}\), and \(\forall\theta\in\Theta :\) \[f(\mathbf{y}\mid\theta)=c(\mathbf{x},\mathbf{y})\,f(\mathbf{x}\mid\theta)\iff T(\mathbf{x})=T(\mathbf{y})\] for some positive function \(c\), then \(T(\mathbf{X})\) is a minimal sufficient statistic.

Let \((X_1,\ldots,X_n)\) be a random sample from \[\mathcal{F}=\{N(\mu,\sigma^2):\mu\in \mathbb{R},\; \sigma^2\in\mathbb{R}^+\}.\] Find a minimal sufficient statistic for \((\mu, \sigma^2)\).

Let \((X_1,\ldots,X_n)\) be a random sample from \[\mathcal{F}=\{U(\theta-1/2,\theta+1/2):\theta\in \mathbb{R}\}.\] Show that the sufficient statistic found before is minimal.

An experiment is defined as \(E=\left(\mathbf{X},\mathcal{F}_{\theta},\theta\right)\) where \(\mathcal{F}_{\theta}=\{F(\mathbf{x}\mid\theta):\theta\in\Theta\}\).

Any inference about \(\theta\) given \(\mathbf{X}=\mathbf{x}\) obtained through the experiment \(E\) is denoted by \(Ev(E, \mathbf{x})\).

The sufficiency principle Consider an experiment \(E\) and a sufficient statistic for \(\theta\), \(T(\mathbf{X}).\) Then, \[\forall (\mathbf{x},\mathbf{y})\in\mathcal{X}^2: T(\mathbf{x})=T(\mathbf{y})\implies Ev(E, \mathbf{x})=Ev(E, \mathbf{y}).\]

Notes

  1. This principle seams quite reasonable and, therefore, appealing.

  2. However, it is very model-dependent and so it requires firm belief in the model.

  3. Common frequentist statistical procedures violate this principle. For example, model checking using residuals that, usually, are not based on sufficient statistics.

1.3 Ancillary statistics

A statistic whose distribution does not depend on \(\theta\) is called an ancillary statistic.

  1. \(T(\mathbf{X}) = c\) is a trivial ancillary statistic in any model.

  2. Let \((X_1,\ldots,X_n)\) be a random sample from any member of the location-scale family of distributions. Any statistic that is a function of

    \[\left(\frac{X_1-\lambda}{\delta},\ldots,\frac{X_n-\lambda}{\delta}\right)\]

    is an ancillary statistic.

Let \((X_1,\ldots,X_n)\) be a random sample from \[\mathcal{F}=\{U(\theta-1/2,\theta+1/2):\theta\in \mathbb{R}\}.\] Show that \[(R,C)=\left(X_{(n)}-X_{(1)}, \frac{X_{(1)}+X_{(n)}}{2}\right)\] is a minimal sufficient statistic but \(R\) is an ancillary statistic.

A statistic \(T(\mathbf{X})\) is called a complete statistic if and only if \[E[h(T)\mid \theta]=0\implies P(h(T)=0)=1,\,\forall \theta\in\Theta.\]

Notes

  1. Any non constant function of a complete statistic cannot be an ancillary statistic.
  1. \(T\) is complete \(\implies\) \(U=g(T)\) is also complete.

Let \((X_1,\ldots,X_n)\) be a random sample from \[\mathcal{F}=\{Ber(\theta):\theta\in ]0,1[\}.\] Show that \(T(\mathbf{X})=\sum_{i=1}^n{X_i}\) is a sufficient and complete statistic.

Theorem A sufficient and complete statistic is also a minimal sufficient statistic.

  1. Let \((X_1,\ldots,X_n)\) be a random sample from

    \[\mathcal{F}=\{U(0,\theta):\theta\in \mathbb{R}^+\}.\]

    Show that \(T(\mathbf{X})=X_{(n)}\) is a sufficient and complete statistic.

  2. Show that for any model with a minimal sufficient statistic that is not complete there are no sufficient and complete statistics.

  3. Let \((X_1,\ldots,X_n)\) be a random sample from \(\mathcal{F}=\{U(\theta,2\theta):\theta\in \mathbb{R}^+\}\). Find a minimal sufficient statistic for \(\theta\). Is that statistic complete?

Basu’s theorem A sufficient and complete statistic is independent of any ancillary statistic.

Useful to prove independence without finding a joint distribution but . . . to prove that a statistic is complete is often a difficult problem.

1.4 Exponential families of distributions

A family of distributions \(\mathcal{F}=\{F(\mathbf{x}\mid \theta):\theta\in \Theta\subset\mathbb{R}^p\}\) is a \(k\)-parametric exponential family if \[f(\mathbf{x}\mid \theta)=c(\theta)h(\mathbf{x})\exp\left\{\sum_{j=1}^k{n_j(\theta)T_j(\mathbf{x})}\right\}\] for some non-negative functions \(c\) and \(h\).

The support of \(f\) can not depend on \(\theta\).

Canonical form

\[f(\mathbf{x}\mid \theta)=c(\theta)h(\mathbf{x})\exp\left\{\sum_{j=1}^k{n_j(\theta)T_j(\mathbf{x})}\right\}\]

\(\alpha_j= n_j(\theta),\;j=1,\ldots,k\)natural parameters

\(A=\{\alpha\in\mathbb{R}^k : \theta(\alpha)\in\Theta\}\)natural parameter space

Show that \(\mathcal{F}=\{Geo(\theta):\theta\in ]0,1[\}\) is an uniparametric exponential family.

Some properties

  1. An exponential family is closed under random sampling.
  1. For any member of a \(k\)-parametric exponential family there is always a \(k\)-dimensional sufficient statistic regardless of the sample size \(n\).

  2. Any model with a \(k\)-dimensional sufficient statistic for any sample size \(n\) is a \(k\)-parametric exponential family if its support does not depend on the parameter.

Theorem The sufficient statistic for a \(k\)-parametric exponential family is complete if the natural parameter space contains an open set of \(\mathbb{R}^k\).

  1. Show that the minimal sufficient statistic for \[\mathcal{F}=\{N(\mu,\sigma^2):\mu\in \mathbb{R},\; \sigma^2\in\mathbb{R}^+\}\] found before is also complete.

  2. Investigate the model \[\mathcal{F}=\{N(\theta,\theta^2):\theta\in \mathbb{R}\backslash \{0\}\}\] regarding completeness.

Back to Basu’s theorem . . .

Let \((X_1,\ldots,X_n)\) be a random sample from

\[\mathcal{F}=\{N(\mu,\sigma^2):\mu\in \mathbb{R},\; \sigma^2\in\mathbb{R}^+\}.\]

Show that \(\bar{X}\) and \(S^2\) are independent.

1.5 Sufficiency in restricted models

Quite often some parameters in a model are just auxiliary and there is no real inferential interest on them – the so called nuisance parameters.

Consider \(\mathcal{F}_{\theta}\) with \(\theta=(\gamma, \phi)\in\Gamma\times\Phi=\Theta.\)

\(T=T(\mathbf{X})\) is said specific sufficient (ancillary) for \(\gamma\) if \(T\) is sufficient (ancillary) for \(\gamma\), \(\forall \phi\in\Phi\).

\(T\) is specific sufficient for \(\gamma\):

\[f(\mathbf{x}\mid \gamma,\phi) = f(\mathbf{x}\mid t,\phi)\ f(t\mid\gamma,\phi)\] \(T\) is specific ancillary for \(\phi\): \[f(\mathbf{x}\mid \gamma,\phi) = f(\mathbf{x}\mid t,\phi,\gamma)\ f(t\mid\gamma)\]

\(T(\mathbf{X})\) is said partial sufficient for \(\gamma\) if it is specific sufficient for \(\gamma\) and specific ancillary for \(\phi\).

\(T\) is partial sufficient for \(\gamma\): \[f(\mathbf{x}\mid \gamma,\phi) = f(\mathbf{x}\mid t,\phi)\ f(t\mid\gamma)\]

\(T\) partial sufficient for \(\gamma\iff T\) partial ancillary for \(\phi\)

  1. Check if \(\bar{X}\) and \(S^2\) are partial sufficient for \(\mu\) and \(\sigma^2\) in \[\mathcal{F}=\{N(\mu,\sigma^2):\mu\in \mathbb{R},\; \sigma^2\in\mathbb{R}^+\}.\]

  2. Let \((X_i,Y_i),\; i=1,\ldots,n\), be a random sample from \((X,Y)\) such that \(X\mid\phi\sim Poi(\phi)\) and \(Y\mid X,\gamma\sim Bi(x,\gamma)\). With \(T=\sum_{i=1}^n{X_i}\) and \(U=\sum_{i=1}^n{Y_i}\) show that \(T\) is partial sufficient for \(\phi\) but \(U\) is not specific sufficient nor specific ancillary for \(\gamma\).

1.6 Sufficiency and Fisher’s information

A uniparametric model with the following properties is called a regular model:

  1. The model is identifiable, that is, \(\theta \rightarrow F_{\theta}\) is a one-to-one transformation, and \(\Theta\) is an open interval;

  2. The support of the model does not depend on \(\theta\);

  3. \(f(x\mid\theta)\) is differentiable in \(\Theta\) and \(\frac{\partial f}{\partial \theta}\) is integrable in \(\mathcal{X}\);

  4. The operators \(\frac{\partial}{\partial \theta}\) and \(\int dx\) can be permuted.

\[S(\mathbf{x}\mid\theta)=\frac{\partial \log f(\mathbf{x}\mid\theta)}{\partial \theta}\] is called the score function, with \(S(\mathbf{x}\mid\theta)=0\) in \(\mathcal{X}_0=\{\mathbf{x}:f(\mathbf{x}\mid\theta)=0,\;\forall\theta\}\).

\(S(\mathbf{x}\mid\theta)\) measures the variation of \(\log f(\mathbf{x}\mid\theta)\) in \(\Theta\) for a given \(\mathbf{x}\).

\(I_{\mathbf{X}}(\theta)=Var[S(\mathbf{X}\mid\theta)]\) is called the Fisher’s information measure.

A measure of dispersion of \(S(\mathbf{X}\mid\theta)\) in \(\mathcal{X}\) is taken as a measure of sample information.

Theorem For a regular model we have \(E[S(\mathbf{X}\mid\theta)]=0\) and, therefore, \[I_{\mathbf{X}}(\theta)=E[S^2(\mathbf{X}\mid\theta)].\]

Some properties

  1. For \(\theta=g(\phi)\) with \(g\) differentiable \[I_{\mathbf{X}}(\phi)=I_{\mathbf{X}}\left(g(\phi)\right)\left(\frac{dg(\phi)}{d\theta}\right)^2.\]

  2. For a regular model with \(0<I_{\mathbf{X}}(\theta)<+\infty\) \[I_{\mathbf{X}}(\theta)=-E\left[\frac{\partial^2 \log f(\mathbf{X}\mid\theta)}{\partial \theta^2}\right].\]

  1. If \(\mathbf{X}=\left(\mathbf{X}_1,\mathbf{X}_2\right)\) with \(\mathbf{X}_1\) and \(\mathbf{X}_2\) independent then \(I_{\mathbf{X}}(\theta)=I_{\mathbf{X_1}}(\theta)+I_{\mathbf{X_2}}(\theta)\) and, consequently, \(I_{\mathbf{X}}(\theta)=nI(\theta)\) where \(I(\theta)\) represents the Fisher’s information for a single observation.

  2. For any statistic \(T=T(\mathbf{X})\) we have \(I_{T}(\theta) \leq I_{\mathbf{X}}(\theta)\) and \(I_{T}(\theta) = I_{\mathbf{X}}(\theta)\) if and only if \(T\) is sufficient.

The previous definitions and properties can be naturally generalized for \(\theta\in\mathbb{R}^k\), with \(k>1\), (the Fisher’s information matrix).

Let \((X_1,\ldots,X_n)\) be a random sample from \[\mathcal{F}=\{Bi(k,\theta):\theta\in ]0,1[\}.\]

Determine the Fisher’s information measure for \(\theta\) and for \(\phi=\frac{\theta}{1-\theta}\).

1.7 The likelihood principle

Let \((X_1,\ldots,X_n)\) be a random sample from \(\mathcal{F}=\{F(x\mid \theta):\theta\in \Theta\}\). The function \[L(\theta\mid \mathbf{x})\equiv f(\mathbf{x}\mid\theta)\] is called the likelihood function.

The likelihood function is neither a pf nor a pdf for \(\theta\)!

The likelihood function is another data-reduction device that is widely used in Statistics:

The likelihood principle Consider two experiments \(E_1\) and \(E_2\) with a common parameter \(\theta\). Then,

\(\forall \mathbf{x}_1\in\mathcal{X}_1\;\forall\mathbf{x}_2\in\mathcal{X}_2:\)

\(L_1(\theta\mid\mathbf{x}_1)=c(\mathbf{x}_1,\mathbf{x}_2)L_2(\theta\mid\mathbf{x}_2),\) \(\forall \theta\in\Theta\implies\)

\[\implies Ev(E_1, \mathbf{x}_1)=Ev(E_2, \mathbf{x}_2).\]

Many frequentist statistical procedures violate this principle and so it faces strong rejection.

Two experimenters, \(E_1\) and \(E_2\) wanted to test \(H_0:\theta=1/2\) against \(H_1:\theta>1/2\) in \(\mathcal{F}=\{Ber(\theta):\theta\in ]0,1[\}\). Both observed 9 successes and 3 failures from two different experiments:

Note that the likelihood functions are proportional:

\(L_1(\theta\mid X_1=9)=\binom{12}{9}\theta^9(1-\theta)^3\)

\(L_2(\theta\mid X_2=12)=\binom{11}{2}(1-\theta)^3\theta^9\)

The p-values are given by:

\(p_1=P(X_1\geq 9\mid\theta=1/2)=1-F_{Bin(12,1/2)}(8)=\) 0.073

\(p_2=P(X_2\geq 12\mid\theta=1/2)=1-F_{NegBin(3,1/2)}(11)=\) 0.033

Theorem The likelihood principle implies the sufficiency principle.

The conditionality principle Consider a set of \(k\) experiments with a common parameter \(\theta\), \(E_i=\left(\mathbf{X}_i,\{f_i(\mathbf{x}_i\mid\theta)\},\theta\right)\), \(i=1,\ldots,k\), from which is randomly selected an experiment \(E_J^*\) with probabilities \(p_j=P(J=j),\,j=1,\ldots,k\), independent of \(\theta\). Then, \[Ev(E_J^*, \{j,\mathbf{x}_j\})=Ev(E_j, \mathbf{x}_j).\]

Notes

  1. In practice, this principle is well accepted.

  2. However, on theoretical grounds it raises some difficulties when used together with the sufficiency principle.

Birnbaum’s theorem The sufficiency and the conditionality principles are jointly equivalent to the likelihood principle.

Notes

  1. The exact conditions under which this theorem is valid are still today under much controversy (Evans, M. (2013) What does the proof of Birnbaum’s theorem prove?).

  2. To this day, frequentist Statistics has failed to establish itself on solid and universally accepted principles.

pages
grid_view zoom_out zoom_in swap_horiz autorenew visibility_off
menu
fullscreen
stylus_note ink_eraser delete_forever content_paste navigate_before navigate_next
draw