\(\def\bs#1{\boldsymbol{#1}} \def\b#1{\mathbf{#1}} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\r}{r} \DeclareMathOperator{\det}{det}\)
To select terms in linear models, we have considered:
the best subsets method
Does not scale well to large \(p\)
stepwise regression
Can inflate coefficients’ estimates, deflate standard errors and can lead to arbitrary variable selection under multicollinearity
Penalized regression can provide estimation shrinkage and variable selection in a single step.
In general, penalised regression can be formulated as the following maximisation problem
\[\hat{\bs{\beta}} = \underset{\bs{\beta}}{\arg\max}\left\{ \mathcal{L}(\bs{\beta})-\lambda P(\bs{\beta})\right\}\]
where \(P\) is a penalty function and \(\lambda \geq 0\) is a regularization or tuning parameter.
Ridge regression (\(L_2\) regularization)
\[P(\bs{\beta})=\sum_{i=1}^{p-1}\beta^2_i\]
\(\implies \hat{\beta}_i=\hat{\beta}_i^{OLS}\frac{1}{1 + n\lambda}\) (for orthonormal terms)
Developed to handle multicollinearity in LM’s, it provides an uniform shrinking of estimates.
LASSO regression (\(L_1\) regularization)
\[P(\bs{\beta})=\sum_{i=1}^{p-1}|\beta_i|\]
\(\implies \hat{\beta}_i=\hat{\beta}_i^{OLS}\max\left\{0,1-\frac{n\lambda}{\left|\hat{\beta}_i^{OLS}\right|}\right\}\) (for orthonormal terms)
Non-uniform shrinkage and terms selection.
Elastic net regression
\[P(\bs{\beta})=(1-\alpha)\sum_{i=1}^{p-1}\beta^2_i + \alpha\sum_{i=1}^{p-1}|\beta_i|,\]
with \(\alpha\in [0,1]\).
Notes
Requires standardized covariates (correlation transformated) \(\leadsto\) quantitative covariates only!
These ideas are not restricted to LM’s.
R: glmnet package
Linear models were defined as
\[Y_i \sim \text {N}(\mu_i, \sigma^2),\ i=1,\ldots,n\]
with \(\mu_i=\b{x}'_i\bs{\beta}\).
For many observable r. v.’s this Gaussian assumption is not reasonable.
Let us consider the observation of a binary response variable:
\[Y_i \sim \text {Ber}(\theta_i),\, \theta_i=P(Y_i=1)\in ]0,1[,\]
independent for \(i=1,\ldots,n\).
Now we can’t follow directly the previous modelling strategy with \[E[Y_i\mid \b{x}_i]=\theta_i=\b{x}_i'\bs{\beta}\]
Why?
Instead, we can try something like \[g(\theta_i)=\b{x}_i'\bs{\beta}\]
where \(g: ]0,1[ \rightarrow \mathbb{R}\) is called a link function.
A possible choice is the logit function:
\[\mathop{\mathrm{\text{logit}}}(\theta)=\log \left(\frac{\theta}{1-\theta}\right)\]
The inverse is the logistic function:
\[g^{-1}(x)=\frac{1}{1 + e^{-x}}=\frac{e^x}{e^x+1}\]
with \(x\in\mathbb{R}\).
The logistic regression model can be written as:
\[Y_i \sim \text {Ber}\left(g^{-1}(\b{x}_i'\bs{\beta})\right),\]
independent for \(i=1,\ldots,n\).
Notes
The interpretation of regression coefficients is now related to the logarithm of the chance of sucess:
logarithm of the chance of sucess for \(\b{x}=\b{0}\)
variation of the logarithm of the chance of sucess due to an unit increase of \(x_i\) while the remaining terms are held constant
Other inferences rely on approximations that require large samples, such as:
\[\frac{\hat{\beta}_i-\beta_i}{se(\hat{\beta}_i)}\stackrel{a}{\sim} N(0,1)\]
where \(se(\hat{\beta}_i)\) is computed from the Hessian matrix of the likelihood function.
Other functions can be used as the inverse link function, such as \(\Phi(x)\), that leads to the probit regression model.
Generalization (repeated observations)
\[Y_i \sim \text {binomial}(n_i, \theta_i),\]
independent for \(i=1,\ldots,k\).
\(E\left[\frac{Y_i}{n_i}\right] = \theta_i\) e \(\mathop{\mathrm{\text{logit}}}(\theta_i)=\b{x}_i'\bs{\beta}\)
In the following table we have the number of deaths, \(y_i\), in 8 groups with \(n_i\) beetles that were exposed to a dosis \(x_i\) de \(CS_2\).
| \(i\) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| \(y_i\) | 6 | 13 | 18 | 28 | 52 | 53 | 61 | 60 |
| \(n_i\) | 59 | 60 | 62 | 56 | 63 | 59 | 62 | 60 |
| \(x_i\)(mg) | 1.6907 | 1.7242 | 1.7552 | 1.7842 | 1.8113 | 1.8369 | 1.8610 | 1.8839 |
data.frame(
y = c(6, 13, 18, 28, 52, 53, 61, 60),
n = c(59, 60, 62, 56, 63, 59, 62, 60),
x = c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839)
) -> beetles
model <- glm(cbind(y, n - y) ~ x, family = "binomial", data = beetles)
estimates <- coef(model)
estimates(Intercept) x
-60.7 34.3
Call:
glm(formula = cbind(y, n - y) ~ x, family = "binomial", data = beetles)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -60.72 5.18 -11.7 <2e-16 ***
x 34.27 2.91 11.8 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284.202 on 7 degrees of freedom
Residual deviance: 11.232 on 6 degrees of freedom
AIC: 41.43
Number of Fisher Scoring iterations: 4
\(\leadsto \widehat{\text{LD}}_{50}\approx 1.77\) e \(\widehat{\text{LD}}_{90}\approx 1.83\)
The previous idea can be applied to other distributions. For example, for data resulting from counts, the Poisson distribution is frequently used.
\[Y_i \sim \text {Poi}(\lambda_i),\, \lambda_i\in \mathbb{R}^+,\]
independent for \(i=1,\ldots,n\).
In this case, the most usual link function is \(\log(\lambda_i)=\b{x}_i'\bs{\beta}\).
Goodness-of-fit
The deviance of a model \(M\) is defined as
\[-2\log \frac{\mathcal{L}\left(\hat{\bs{\beta}}_M\right)}{\mathcal{L}\left(\hat{\bs{\beta}}_S\right)}= 2\left(\log \mathcal{L}\left(\hat{\bs{\beta}}_S\right)- \log \mathcal{L}\left(\hat{\bs{\beta}}_M\right)\right)\]
where \(S\) represents the saturated model (one parameter per observation).
In R we have:
Null deviance: \(2\left(\log \mathcal{L}\left(\hat{\bs{\beta}}_S\right)- \log \mathcal{L}\left(\hat{\bs{\beta}}_0\right)\right)\)
(\(df=n-1\))
Residual deviance: \(2\left(\log \mathcal{L}\left(\hat{\bs{\beta}}_S\right)- \log \mathcal{L}\left(\hat{\bs{\beta}}_M\right)\right)\)
(\(df=n-p\))
The difference can be taken as a measure of the explanatory capability of a model \(M\) relative to the null model.
The measure
\[\text{AIC} = 2p - 2\log \mathcal{L}\left(\hat{\bs{\beta}}_M\right)\]
is called the Akaike Information Criterion.
Just like the deviance, smaller values of this measure indicate a better fit.
Unlike deviance, the AIC penalizes the increase of models’ size.
Comparison of Nested Models
To test the hypothesis
\(H_0: \text{Reduced Model (R)}\) against \(H_1: \text{Larger Model (L)}\)
one can use the test statistic:
\[G^2 = -2\log \frac{\mathcal{L}\left(\hat{\bs{\beta}}_R\right)}{\mathcal{L}\left(\hat{\bs{\beta}}_L\right)}\underset{H_0}{\stackrel{a}{\sim}}\chi^2_{(p_L - p_R)}\]
\(H_0\) is rejected if \(G^2_o > F^{-1}_{\chi^2_{\left(p_L - p_R\right)}}(1-\alpha)\).