5 Other models

5.1 Penalized regression

To select terms in linear models, we have considered:

the best subsets method

Does not scale well to large \(p\)
stepwise regression

Can inflate coefficients’ estimates, deflate standard errors and can lead to arbitrary variable selection under multicollinearity

Penalized regression can provide estimation shrinkage and variable selection in a single step.

In general, penalised regression can be formulated as the following maximisation problem

\[\hat{\bs{\beta}} = \underset{\bs{\beta}}{\arg\max}\left\{ \mathcal{L}(\bs{\beta})-\lambda P(\bs{\beta})\right\}\]

where \(P\) is a penalty function and \(\lambda \geq 0\) is a regularization or tuning parameter.

Ridge regression (\(L_2\) regularization)

\[P(\bs{\beta})=\sum_{i=1}^{p-1}\beta^2_i\]

\(\implies \hat{\beta}_i=\hat{\beta}_i^{OLS}\frac{1}{1 + n\lambda}\) (for orthonormal terms)

Developed to handle multicollinearity in LM’s, it provides an uniform shrinking of estimates.

LASSO regression (\(L_1\) regularization)

\[P(\bs{\beta})=\sum_{i=1}^{p-1}|\beta_i|\]

\(\implies \hat{\beta}_i=\hat{\beta}_i^{OLS}\max\left\{0,1-\frac{n\lambda}{\left|\hat{\beta}_i^{OLS}\right|}\right\}\) (for orthonormal terms)

Non-uniform shrinkage and terms selection.

Elastic net regression

\[P(\bs{\beta})=(1-\alpha)\sum_{i=1}^{p-1}\beta^2_i + \alpha\sum_{i=1}^{p-1}|\beta_i|,\]

with \(\alpha\in [0,1]\).

Notes

The tuning parameters are usually fixed by a grid search minimizing the MSE.

Requires standardized covariates (correlation transformated) \(\leadsto\) quantitative covariates only!
These ideas are not restricted to LM’s.
R: glmnet package

5.2 Generalized linear models

Linear models were defined as

\[Y_i \sim \text {N}(\mu_i, \sigma^2),\ i=1,\ldots,n\]

with \(\mu_i=\b{x}'_i\bs{\beta}\).

For many observable r. v.’s this Gaussian assumption is not reasonable.

Logistic regression

Let us consider the observation of a binary response variable:

\[Y_i \sim \text {Ber}(\theta_i),\, \theta_i=P(Y_i=1)\in ]0,1[,\]

independent for \(i=1,\ldots,n\).

Now we can’t follow directly the previous modelling strategy with \[E[Y_i\mid \b{x}_i]=\theta_i=\b{x}_i'\bs{\beta}\]

Why?

Instead, we can try something like \[g(\theta_i)=\b{x}_i'\bs{\beta}\]

where \(g: ]0,1[ \rightarrow \mathbb{R}\) is called a link function.

A possible choice is the logit function:

\[\mathop{\mathrm{\text{logit}}}(\theta)=\log \left(\frac{\theta}{1-\theta}\right)\]

The inverse is the logistic function:

\[g^{-1}(x)=\frac{1}{1 + e^{-x}}=\frac{e^x}{e^x+1}\]

with \(x\in\mathbb{R}\).

The logistic regression model can be written as:

\[Y_i \sim \text {Ber}\left(g^{-1}(\b{x}_i'\bs{\beta})\right),\]

independent for \(i=1,\ldots,n\).

Notes

The interpretation of regression coefficients is now related to the logarithm of the chance of sucess:

\(\beta_0\)

logarithm of the chance of sucess for \(\b{x}=\b{0}\)

\(\beta_i, i=1,\ldots, p-1\)

variation of the logarithm of the chance of sucess due to an unit increase of \(x_i\) while the remaining terms are held constant

Point estimation is still done by ML but there’s no expression for estimators;

Other inferences rely on approximations that require large samples, such as:

\[\frac{\hat{\beta}_i-\beta_i}{se(\hat{\beta}_i)}\stackrel{a}{\sim} N(0,1)\]

where \(se(\hat{\beta}_i)\) is computed from the Hessian matrix of the likelihood function.
Other functions can be used as the inverse link function, such as \(\Phi(x)\), that leads to the probit regression model.

Generalization (repeated observations)

\[Y_i \sim \text {binomial}(n_i, \theta_i),\]

independent for \(i=1,\ldots,k\).

\(E\left[\frac{Y_i}{n_i}\right] = \theta_i\) e \(\mathop{\mathrm{\text{logit}}}(\theta_i)=\b{x}_i'\bs{\beta}\)

In the following table we have the number of deaths, \(y_i\), in 8 groups with \(n_i\) beetles that were exposed to a dosis \(x_i\) de \(CS_2\).

\(i\)	1	2	3	4	5	6	7	8
\(y_i\)	6	13	18	28	52	53	61	60
\(n_i\)	59	60	62	56	63	59	62	60
\(x_i\)(mg)	1.6907	1.7242	1.7552	1.7842	1.8113	1.8369	1.8610	1.8839

data.frame(
  y = c(6, 13, 18, 28, 52, 53, 61, 60),
  n = c(59, 60, 62, 56, 63, 59, 62, 60),
  x = c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839)
) -> beetles

model <- glm(cbind(y, n - y) ~ x, family = "binomial", data = beetles)
estimates <- coef(model)
estimates

(Intercept)           x 
      -60.7        34.3

summary(model)


Call:
glm(formula = cbind(y, n - y) ~ x, family = "binomial", data = beetles)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -60.72       5.18   -11.7   <2e-16 ***
x              34.27       2.91    11.8   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 284.202  on 7  degrees of freedom
Residual deviance:  11.232  on 6  degrees of freedom
AIC: 41.43

Number of Fisher Scoring iterations: 4

\(\leadsto \widehat{\text{LD}}_{50}\approx 1.77\) e \(\widehat{\text{LD}}_{90}\approx 1.83\)

Poisson Regression

The previous idea can be applied to other distributions. For example, for data resulting from counts, the Poisson distribution is frequently used.

\[Y_i \sim \text {Poi}(\lambda_i),\, \lambda_i\in \mathbb{R}^+,\]

independent for \(i=1,\ldots,n\).

In this case, the most usual link function is \(\log(\lambda_i)=\b{x}_i'\bs{\beta}\).

Goodness-of-fit

The deviance of a model \(M\) is defined as

\[-2\log \frac{\mathcal{L}\left(\hat{\bs{\beta}}_M\right)}{\mathcal{L}\left(\hat{\bs{\beta}}_S\right)}= 2\left(\log \mathcal{L}\left(\hat{\bs{\beta}}_S\right)- \log \mathcal{L}\left(\hat{\bs{\beta}}_M\right)\right)\]

where \(S\) represents the saturated model (one parameter per observation).

In R we have:

Null deviance: \(2\left(\log \mathcal{L}\left(\hat{\bs{\beta}}_S\right)- \log \mathcal{L}\left(\hat{\bs{\beta}}_0\right)\right)\)

(\(df=n-1\))

Residual deviance: \(2\left(\log \mathcal{L}\left(\hat{\bs{\beta}}_S\right)- \log \mathcal{L}\left(\hat{\bs{\beta}}_M\right)\right)\)

(\(df=n-p\))

The difference can be taken as a measure of the explanatory capability of a model \(M\) relative to the null model.

The measure

\[\text{AIC} = 2p - 2\log \mathcal{L}\left(\hat{\bs{\beta}}_M\right)\]

is called the Akaike Information Criterion.

Just like the deviance, smaller values of this measure indicate a better fit.

Unlike deviance, the AIC penalizes the increase of models’ size.

Comparison of Nested Models

To test the hypothesis

\(H_0: \text{Reduced Model (R)}\) against \(H_1: \text{Larger Model (L)}\)

one can use the test statistic:

\[G^2 = -2\log \frac{\mathcal{L}\left(\hat{\bs{\beta}}_R\right)}{\mathcal{L}\left(\hat{\bs{\beta}}_L\right)}\underset{H_0}{\stackrel{a}{\sim}}\chi^2_{(p_L - p_R)}\]

\(H_0\) is rejected if \(G^2_o > F^{-1}_{\chi^2_{\left(p_L - p_R\right)}}(1-\alpha)\).

stylus_note ink_eraser delete_forever content_paste navigate_before navigate_next

draw