\(\def\bs#1{\boldsymbol{#1}} \def\b#1{\mathbf{#1}} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\r}{r} \DeclareMathOperator{\det}{det} \DeclareMathOperator{\logit}{\text{logit}}\)
What is a statistical model?
A probabilistic model that establishes mathematical relationships between some statistical variables
A way to embody a set of assumptions about the observation of data
Why are there so many statistical models?
\(+\) time, space, text, images, sounds, . . .
Our context
\(Y\) – an univariate quantitative response variable
\({\bf x}=(x_1,\ldots,x_k)'\) – any number of error-free predictors, covariates or explanatory variables
Some important issues
How strong is the relationship between each of those covariates and the response variable?
How accurately can we predict future or past unobserved values of \(Y\)?
What is a linear model?
Basic (additive) modelling idea:
\[Y = f(\mathbf{x}, \boldsymbol{\beta}) + E\]
where \(E\) is called the random error.
A linear model is characterized by:
the choice of a linear function of \(\bs{\beta}\):
\(f(\mathbf{x}, \boldsymbol{\beta})=\mathbf{x}' \boldsymbol{\beta}\) – the linear predictor
a set of assumptions regarding the random error:
\(E\sim N(0, \sigma^2)\), uncorrelated \(\forall \mathbf{x}\) (usually)
Which of the following models are linear?
\(Y=\beta_0 + \beta_1 x + \beta_2 x^2 + E\)
\(\log Y=\beta_0 + \beta_1 \log x + E\)
\(Y=\beta_0 x^{\beta_1} + E\)
\(Y=\beta_0 x^{\beta_1} E\)
Alternative formulation
\[Y\sim F(\boldsymbol{\theta})\]
where \(F\) belongs to the exponential family of distributions
and let \(g(E[Y])=\mathbf{x}' \boldsymbol{\beta}\), where \(g\) is called the link function
\[\begin{cases} F\ \text{gaussian} \\ g= \text{identity function} \end{cases} \implies \text{Linear model}\]
Other choices for bounded or discrete responses:
Bernoulli (Logistic regression)
Poisson/Negative binomial
Gamma
Beta
\(\leadsto\) Generalized linear models
Why linear models?
Key analytical tool in the investigation of competing causes for some phenomenon
Some of the simplest models that we know the most about
Building block for more complex models (GLM’s, mixed models, machine learning algorithms, . . .)
Data: \((y_i,x_i)\), \(i=1,\ldots,n\)
\[Y_i=\beta_0+\beta_1x_i+E_i,\, i=1,\ldots,n\]
\(E_i\) – random error associated to \(Y_i=(Y\mid x=x_i)\)
OLS estimation
\[S(\beta_0,\beta_1)=\sum_{i=1}^{n}{E_i^2}=\sum_{i=1}^{n}{(Y_i-\beta_0-\beta_1x_i)^2}\]
\[(\hat{\beta}_0,\hat{\beta}_1)_{OLS}=\arg\min_\limits{\boldsymbol{\beta}\in \mathbb{R}^2} S(\beta_0,\beta_1)\]
\[\left\lbrace\begin{array}{l} \frac{\partial S}{\partial \beta_0}=0\\ \frac{\partial S}{\partial \beta_1}=0\\ \end{array}\right. \iff \left\lbrace\begin{array}{lcl} \beta_0 n+\beta_1 \sum\limits_{i = 1}^n{x_i} &=& \sum\limits_{i = 1}^n {Y_i }\\ \beta_0 \sum\limits_{i = 1}^n{x_i}+\beta_1 \sum\limits_{i = 1}^n{x_i^2} &=& \sum\limits_{i = 1}^n {x_i Y_i}\\ \end{array}\right.\]
\[\iff \left\lbrace\begin{array}{lcl} \hat {\beta }_1 &=& \frac{\sum\limits_{i = 1}^n {x_i Y_i} - n\bar {x}\bar {Y}}{\sum\limits_{i = 1}^n {x_i^2 } - n\bar {x}^2}\\ \hat{\beta }_0 &=& \bar{Y} - \hat{\beta }_1 \bar{x}\\ \end{array}\right.\]
A unique solution exists if and only if \[\begin{vmatrix} n & \sum\limits_{i = 1}^n{x_i} \\ \sum\limits_{i = 1}^n{x_i} & \sum\limits_{i = 1}^n{x_i^2} \end{vmatrix}=n \sum_{i = 1}^n\left(x_i-\bar{x}\right)^2\neq 0,\]
that is, if the sample contains at least two different values of \(x\).
When the solution exists, it is indeed a minimum since the Hessian matrix of \(S\) is \[2\times\begin{pmatrix} n & \sum\limits_{i = 1}^n{x_i} \\ \sum\limits_{i = 1}^n{x_i} & \sum\limits_{i = 1}^n{x_i^2} \end{pmatrix}\]
and its determinant is \(4n \sum\limits_{i = 1}^n{(x_i-\bar{x})^2}>0\).
How can we trust this estimator?
Under the Gauss-Markov conditions:
\(E[E_i]=0\) \(\iff E[Y_i]=\beta_0+\beta_1x_i\)
\(Var[E_i]=\sigma^2\) \(\iff Var[Y_i]=\sigma^2\)
\(E_i\)’s uncorrelated \(\iff\) \(Y_i\)’s uncorrelated
it can be shown that the OLS estimator is the Best Linear Unbiased Estimator (BLUE) for \(\boldsymbol{\beta}\).
Alternative
ML estimation under the Gaussian assumption
\[E_i\sim N\left(0,\sigma^2\right)\iff Y_i\sim N\left(\beta_0+\beta_1 x_i,\sigma^2\right)\]
\[\mathcal{L}\left(\beta_0,\beta_1,\sigma^2\mid \mathbf{y}\right)=\prod_{i=1}^{n}{\frac{1}{\sqrt{2\pi\sigma^2}}\, e^{-\frac{1}{2\sigma^2}\left(y_i-(\beta_0+\beta_1 x_i)\right)^2}}\]
Somewhat surprisingly, the OLS and ML estimators for \(\beta_0\) an \(\beta_1\) coincide and
\[\hat{\sigma}_{ML}^2=\frac{\sum_{i=1}^{n}{\left(Y_i-(\hat{\beta}_0+\hat{\beta}_1 x_i)\right)^2}}{n}\]
Some notation
\(SXX = \sum\limits_{i=1}^n (x_i-\bar{x})^2 = \sum\limits_{i=1}^n x_i^2-n \bar{x}^2 = \sum\limits_{i=1}^n x_i(x_i-\bar{x})\)
\(\begin{split} SXY & = \sum_{i=1}^n (x_i-\bar{x})(Y_i-\bar{Y}) = \\ & = \sum_{i=1}^n x_i(Y_i-\bar{Y}) = \\ & = \sum_{i=1}^n Y_i(x_i-\bar{x}) = \\ & = \sum_{i=1}^n x_iY_i - n \bar{x} \bar{Y} \end{split}\)
Properties of the estimators
\[\hat{\beta}_1 = \frac{SXY}{SXX} = \frac{\sum_{i=1}^n Y_i(x_i-\bar{x})}{SXX} = \sum_{i=1}^n k_i Y_i\] with \(k_i = \dfrac{x_i-\bar{x}}{SXX}\)
\(\therefore \hat{\beta}_1\) is a linear estimator
\(E[\hat{\beta}_1] = \sum_{i=1}^n k_i E[Y_i] = \ldots = \beta_1\)
\(\therefore \hat{\beta}_1\) is an unbiased estimator of \(\beta_1\)
\(Var[\hat{\beta}_1] = \sum_{i=1}^n k_i^2 Var[Y_i] = \sigma^2\sum_{i=1}^n k_i^2 = \ldots = \dfrac{\sigma^2}{SXX}\)
\(\therefore \hat{\beta}_1 \sim N\left(\beta_1, \dfrac{\sigma^2}{SXX}\right)\)
\[\hat{\beta}_0 = \bar{Y} - \hat{\beta }_1 \bar{x} = \ldots = \sum_{i=1}^n \left(\frac{1}{n} - \bar{x}k_i \right) Y_i\]
\(\therefore \hat{\beta}_0\) is also a linear estimator and \(E[\hat{\beta}_0] = \ldots = \beta_0\)
\(\begin{split} Var[\hat{\beta}_0] & = Var[\bar{Y} - \hat{\beta }_1 \bar{x}] = \\ & = Var[\bar{Y}] + \bar{x}^2 Var[\hat{\beta }_1] - 2 \bar{x} Cov[\bar{Y}, \hat{\beta }_1] = \\[0.25cm] & = \hspace{12.2cm}= \\[0.5cm] & = \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{SXX} \right) \end{split}\)
\(\therefore \hat{\beta}_0 \sim N\left(\beta_0, \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{SXX} \right) \right)\)
Show that, in general, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are not independent since
\[Cov[\hat{\beta}_0, \hat{\beta}_1] = -\frac{\bar{x}\sigma^2}{SXX}\]
Regarding the estimator of \(\sigma^2\), \[\hat{\sigma}^2_{ML}=\frac{\sum_{i=1}^n(Y_i-\hat{Y}_i)^2}{n} = \frac{SSE}{n}\]
it can be shown that \(E[SSE]=(n-2)\sigma^2 \implies E[\hat{\sigma}^2_{ML}] = \dfrac{n-2}{n}\sigma^2\)
\(\therefore \hat{\sigma}^2=\dfrac{n}{n-2}\hat{\sigma}^2_{ML}=\frac{SSE}{n-2}\) is an unbiased estimator of \(\sigma^2\).
Now, Cochran’s theorem states that:
\(\frac{SSE}{\sigma^2}= \frac{(n-2)\hat{\sigma}^2}{\sigma^2}\sim\chi_{(n-2)}^2\)
\((\hat{\beta}_0, \hat{\beta}_1)\) are jointly independent of \(\hat{\sigma}^2\)
All previous results lead to:
\[\frac{\hat{\beta}_1-\beta_1}{\hat{\sigma}\sqrt{\frac{1}{SXX}}} = \frac{\hat{\beta}_1-\beta_1}{se(\hat{\beta}_1)} \sim t_{(n-2)}\]
and
\[\frac{\hat{\beta}_0-\beta_0}{\hat{\sigma}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{SXX}}} = \frac{\hat{\beta}_0-\beta_0}{se(\hat{\beta}_0)} \sim t_{(n-2)}\]
that allow us to carry out hypothesis tests and compute confidence intervals for \(\beta_0\) and \(\beta_1\).
Inferences for the mean response
\[E[Y|x_0] = \beta_0 + \beta_1 x_0\]
\(\hat{\beta}_0 + \hat{\beta}_1 x_0\) is an unbiased estimator for \(E[Y|x_0]\)
\(Var[\hat{\beta}_0 + \hat{\beta}_1 x_0]= \sigma^2\left(\frac{1}{n} + \frac{(x_0-\bar{x})^2}{SXX}\right)\)
\(\hat{\beta}_0\) and \(\hat{\beta}_1\) are usually correlated but they are jointly gaussian. Therefore:
\[\frac{\hat{\beta}_0+ \hat{\beta}_1 x_0 - E[Y|x_0]}{\hat{\sigma}\sqrt{\frac{1}{n} + \frac{(x_0-\bar{x})^2}{SXX}}} \sim t_{(n-2)}\]
Prediction of a new observation
The goal is to predict \(Y_u\), an unobserved response value for \(x=x_u,\) independent from observed data.
To do that we need to consider 2 sources of uncertainty:
the uncertainty in the mean of the distribution of \(Y_u\);
the variation within that distribution.
Again, the idea is to use some sort of standardized r.v. \[\frac{Y_u-E[Y_u\mid x]}{sd(pred)}\]
Since \(E[Y_u\mid x]\) is unknown, we must use its estimate \(\hat{Y}_u = \hat{\beta}_0 + \hat{\beta}_1 x_u\).
As we have seen \(E[Y_u - \hat{Y}_u] = 0\) and
\(\begin{split} Var[Y_u - \hat{Y}_u] & = Var[Y_u] + Var[\hat{Y}_u] = \\ & = \sigma^2 + \sigma^2\left(\frac{1}{n} + \frac{(x_n-\bar{x})^2}{SXX}\right) = \\ & = \sigma^2\left(1 +\frac{1}{n} + \frac{(x_n-\bar{x})^2}{SXX}\right) \end{split}\)
Finally, it can be shown that
\[\frac{Y_u-\hat{Y}_u}{se(pred)}\sim t_{(n-2)}\]
where \(se(pred)=\hat{\sigma}^2\left(1 +\frac{1}{n} + \frac{(x_u-\bar{x})^2}{SXX}\right)\).
Note
Inferences and predictions drawn from linear models are only valid within the original data domain.
Extrapolations should be made with caution!
\[y_i = \beta_0 + \sum_{j=1}^{p-1}\beta_j x_{i,j} + e_i,\,i=1,\ldots,n\]
\[\underset{n\times 1}{\b{y}} = \underset{n\times p}{\b{X}}\quad \underset{p\times 1}{\bs{\beta}}+\underset{n\times 1}{\b{e}}\]
\[\begin{pmatrix}y_1\\ \vdots\\ y_n\end{pmatrix}=\begin{pmatrix} 1 & x_{1,1} & \cdots & x_{1,p-1}\\ \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n,1} & \cdots & x_{n,p-1} \end{pmatrix} \begin{pmatrix}\beta_0\\ \vdots\\ \beta_{p-1}\end{pmatrix}+ \begin{pmatrix}e_1\\ \vdots\\ e_n\end{pmatrix}\]
\[\b{y} = \b{X} \bs{\beta} + \b{e}\]
\(\b{y}\ (n\times 1)\) – vector of observations of a quantitative response variable or some transformation of it
\(\b{X}\ (n\times p)\) – matrix of model terms (values of covariates or functions of those) – the model specification matrix or design matrix
\(\bs{\beta}\ (p\times 1)\) – vector of parameters
\(\b{e}\ (n\times 1)\) – vector of random components with \(\b{e}\sim N_n(\b{0}, \sigma^2\b{I})\)
Remarks
Multiple linear regression (1st order model)
\[y_i = \beta_0 + \sum_{j=1}^{p-1}\beta_j x_{i,j} + e_i,\,i=1,\ldots,n\]
\[\begin{pmatrix}y_1\\ \vdots\\ y_n\end{pmatrix}=\begin{pmatrix} 1 & x_{1,1} & \cdots & x_{1,p-1}\\ \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n,1} & \cdots & x_{n,p-1} \end{pmatrix} \begin{pmatrix}\beta_0\\ \vdots\\ \beta_{p-1}\end{pmatrix}+ \begin{pmatrix}e_1\\ \vdots\\ e_n\end{pmatrix}\]
Linear regression by transformation
\[y_i = \beta_0 x_{i}^{\beta_1} e_i,\,i=1,\ldots,n\] \[\Updownarrow\] \[\log y_i = \log\beta_0 + \beta_1 \log x_{i} + \log e_i,\,i=1,\ldots,n\]
\[\begin{pmatrix}\log y_1\\ \vdots\\ \log y_n\end{pmatrix}=\begin{pmatrix} 1 & \log x_{1} \\ \vdots & \vdots \\ 1 & \log x_{n} \end{pmatrix} \begin{pmatrix}\log \beta_0\\ \beta_{1}\end{pmatrix}+ \begin{pmatrix}\log e_1\\ \vdots\\ \log e_n\end{pmatrix}\]
Polynomial regression
\[y_i = \sum_{j=0}^{p-1}\beta_j x_{i}^j + e_i,\,i=1,\ldots,n\]
\[\begin{pmatrix}y_1\\ \vdots\\ y_n\end{pmatrix}=\begin{pmatrix} 1 & x_{1} & x_{1}^2 & \cdots & x_1^{p-1}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n} & x_{n}^2 & \cdots & x_n^{p-1} \end{pmatrix} \begin{pmatrix}\beta_0\\ \vdots\\ \beta_{p-1}\end{pmatrix}+ \begin{pmatrix}e_1\\ \vdots\\ e_n\end{pmatrix}\]
An incomplete rank design
Results from an experiment to compare yields \(Y\) (measured as dried weight of plants) obtained for a control \((x_1=1)\) and a treatment \((x_2=1)\) groups, where \(x_1\) and \(x_2\) are binary variables.
| \(y\) (\(x_1=1\)) | 4.17 | 5.58 | 5.18 | 6.11 | 4.50 | 4.61 | 5.17 | 4.53 | 5.33 | 5.14 |
| \(y\) (\(x_2=1\)) | 6.31 | 5.12 | 5.54 | 5.50 | 5.37 | 5.29 | 4.92 | 6.15 | 5.80 | 5.26 |
Notation: \(y_{i,j}\) is the dried weight of the \(j\)-th plant belonging to the group \(i\), \(j=1,\ldots,n_i\), \(i=1,2\).
\[y_{i,j} = \beta_{0}+ \beta_{1}x_{1,j} + \beta_{2}x_{2,j} + e_{i,j}\] \[\Updownarrow\] \[y_{i,j} = \beta_{0}+ \beta_{i} + e_{i,j}\]
\[\begin{pmatrix} y_{1,1} \\ \vdots \\ y_{1,n_1} \\ y_{2,1} \\ \vdots \\ y_{2,n_2} \end{pmatrix} = \begin{pmatrix} 1&1 &0 \\ \vdots & \vdots &\vdots \\ 1&1 &0 \\ 1&0 &1 \\ \vdots &\vdots &\vdots \\ 1 &0 &1 \end{pmatrix} \begin{pmatrix} \beta_{0} \\ \beta_{1} \\ \beta_{2} \end{pmatrix} + \begin{pmatrix} e_{1,1} \\ \cdots \\ e_{1,n_1} \\ e_{2,1} \\ \cdots \\ e_{2,n_2} \end{pmatrix}\]