3 Building the regression model

\(\def\bs#1{\boldsymbol{#1}} \def\b#1{\mathbf{#1}} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\r}{r} \DeclareMathOperator{\det}{det}\)

3.1 Diagnostics and corretive measures

Potential problems

  1. The linear predictor is not adequate
  1. The error structure is not adequate

    1. Variance not constant (heterocedasticity)

    2. Correlated errors

    3. Normality assumption

  1. Existence of discordant observations (outliers) or observations with an excessive influence on the fitted model

  2. Large correlations between covariates (multicollinearity)

Many diagnostic tools are based on the residuals \(\b{e}=\b{y}-\hat{\b{y}}\):

Usual strategy

Start with some graphical tools and use tests just to clarify suspicious or unclear indications from the plots.

The residuals

Under the working assumptions, a plot of the residuals should look like a symmetrical cloud of points around zero with constant variation.

We have seen that

\[\b{e}=(\b{I} - \b{H})\b{y}\sim N_n(\b{0}, \sigma^2(\b{I} - \b{H}))\]

\(\implies e_i\sim N(0, \sigma^2(1-h_{ii}))\)

\(\therefore\) the residuals are correlated r.v.’s with different variances

Studentized Residuals

\[r_i=\frac{e_i}{se(e_i)}=\frac{e_i}{\hat{\sigma}\sqrt{1-h_{ii}}},\ i=1,\ldots, n\]

are called the internally Studentized residuals.

Note: rstandard() in R

Let \(d_i=y_i-\hat{y}_{-i}\) where \(\hat{y}_{-i}\) is the fitted value after the removal of the \(i^{th}\) observation (deleted residual).

Fortunately, \(d_i=\frac{e_i}{1-h_{ii}}\) and so,

\[\begin{align*} r_i^* & = \frac{e_i}{\sqrt{MSE_{-i} (1-h_{ii})}}=e_i\sqrt{\frac{n-p-1}{SSE (1-h_{ii})-e_i^2}}=\\[0.25cm] & = r_i \sqrt{\frac{n-p-1}{n-p- r_i^2}}\sim t_{(n-p-1)},\ i=1,\ldots, n \end{align*}\]

are called the externally Studentized residuals.

Note: rstudent() in R

In a residuals plot we should look for:

  1. deviation from zero in some systematic way or some clear pattern (A, B1 and B2)
  1. skewness (B3)

  2. extreme values (B4)

  3. some clear dependency on any covariate regarding the previous points (A)

Problem A (linear predictor)

Diagnostics:

Measures:

  1. add new terms to the model

  2. transformation of covariates

  3. try other models

Problem B1 (heterocedasticity)

Diagnostics:

Measures:

  1. transformation of covariates

  2. transformation of \(y\) (Box-Cox or Yeo-Johnson families of transformations)

  3. change the error structure and use weighted least squares (WLS)

Problem B2 (correlated errors)

Diagnostics:

Measures:

  1. work with first differences \(y'_t=y_t-\hat{\rho}y_{t-1}\), where \(\hat{\rho}\) is an estimate of the autocorrelation

  2. WLS or robust regression

Problem B3 (non-normality)

Diagnostics:

Measures:

  1. transformations

  2. use models with correlated errors

Problem C (unusual observations)

Outliers and Leverage points

Influential observations – leverage \(\times\) residual

Remember that \(Var[e_i]=\sigma^2 (1- h_{ii})\).

\(\leadsto\) a large value of \(h_{ii}\) will make \(Var[e_i]\) small and, consequently, the fitted response surface will be pulled towards \((y_i, \b{x}_i)\)

We will call \(h_{ii}\) the leverage of the \(i^{th}\) observation.

Influence on fitted values

We call \(DFFIT_i=\hat{y}_i-\hat{y}_{-i}\) a difference in fitted-value and use

\[DFFITS_i=\frac{DFFIT_i}{\hat{\sigma}_{-i}\sqrt{h_{ii}}}=r^*_i\sqrt{\frac{h_{ii}}{1-h_{ii}}}\]

as a measure of the influence that observation \(i\) has on the fitted value \(\hat{y}_i\).

An observation can be considered influential if:

\(\leadsto\) We need to look at \(n\) \(DFFITS\)!

dffits(fitted_model)

Influence on the regression coefficients

A similar thing can be done for the difference in estimated regression coefficients, that leads to

\[DFBETAS_{k,i}=\frac{\hat{\beta}_k-\hat{\beta}_{k,-i}}{\hat{\sigma}_{-i}\sqrt{c_{kk}}}\]

where \(c_{kk}\) is the \(k^{th}\) element of the main diagonal of \((\b{X}'\b{X})^{-1}\).

An observation can be considered influential if:

\(\leadsto\) Now we have \(n\times p\) \(DFBETAS\)!

dfbetas(fitted_model)

Overall influence

The Cook’s distance is defined as

\[D_i=\frac{(\hat{\b{y}} - \hat{\b{y}}_{-i})'(\hat{\b{y}} - \hat{\b{y}}_{-i})}{p MSE}=\frac{1}{p} (r^*_i)^2\frac{h_{ii}}{1-h_{ii}}\]

and is used as an aggregate measure of influence.

Note Usually the values of \(D_i\) are compared with the median of \(F_{(p, n-p)}\).

                             leverage std.residual cooks.d
High leverage, Low residual    0.4096       -0.163 0.00926
Low leverage, High residual    0.0478        3.616 0.32819
High leverage, High residual   0.4096       -3.916 5.31837

Measures:

  1. Further investigation: is there an explanation? Is it an error? Or a fraud?

  2. Consider the removal with caution.

Problem D (multicollinearity)

The existence of strong correlations between covariates can lead to:

Diagnostics:

A small dataset already used in Chapter2:

  demand education income urbanization
1    167      11.2   31.9         42.2
2    174      10.6   13.2         48.6
3    161      10.6   28.7         42.6
4    162      10.4   26.1         39.0
5    141       9.3   30.1         34.7
6    175      10.8    8.5         44.5
7    164      10.7   24.3         39.1
8    174      10.0   18.6         40.1
9    186      12.0   20.4         45.9
model <- lm(demand ~ ., data=product)
summary(model)

Call:
lm(formula = demand ~ ., data = product)

Residuals:
     1      2      3      4      5      6      7      8      9 
 0.876 -0.972 -1.492  0.762 -4.612 -5.462 -2.129 11.653  1.377 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)    60.014     36.191    1.66    0.158  
education      10.718      4.530    2.37    0.064 .
income         -0.751      0.395   -1.90    0.116  
urbanization    0.240      1.012    0.24    0.822  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.29 on 5 degrees of freedom
Multiple R-squared:  0.845, Adjusted R-squared:  0.753 
F-statistic: 9.11 on 3 and 5 DF,  p-value: 0.0181
X <- model.matrix(demand ~ ., data=product)
eigen(t(X) %*% X)$values
[1] 2.14e+04 5.59e+02 2.10e+00 3.01e-02
car::vif(model)
   education       income urbanization 
        2.31         1.98         3.61 
reduced_model <- update(model, . ~ . - urbanization)
car::vif(reduced_model)
education    income 
     1.03      1.03 

Measures:

  1. Consider the replacement or removal of covariates with caution.

  2. Standardize the variables and use ridge regression.

3.2 Selection of variables

To look for the best possible regression model we need:

  1. A set of candidate models
  1. An evaluation or comparison criterion

Limitations in the inclusion of terms in a regression model

  1. We should have \(n \gg p\), otherwise the risk of overfitting increases.
  1. Usually impossible to consider all \(2^{k}\) subsets of covariates.

Common criteria

  1. t-tests and F-tests

  2. \(R^2_a \uparrow \iff MSE \downarrow\)

  3. Mallow’s \(C_p\) \((|C_p - p| \downarrow)\)

  4. Akaike Information Criterion + . . . (\(AIC \downarrow\))

  5. \(PRESS_p=\sum_{i=1}^n(y_i-\hat{y}_{-i})^2 \downarrow\)

  6. Many others . . .

Automatic search procedures

  1. Best subsets algorithms – finding the “best” subsets of covariates without examining all possible subsets

    R: leaps package

  1. Stepwise regression – develops a sequence of regression models, at each step adding or removing a term. There are three variants: forward, backward and both.

    R: step() or MASS::stepAIC()

Notes

  1. Unlike the best subsets algorithms, stepwise regression outputs a single model.
  1. Stepwise regression faces heavy criticisms, producing inflation in the number of covariates, overestimation in the regression coefficients, underestimation in standard errors, . . .

3.3 Model validation

The final step of a regression analysis is usually the validation of the selected model.

Three basic forms of model validation:

  1. Compare the results with theoretically expected, empirical or simulated results.
  1. Use the model to predict new data and evalute its predictive power.

    Compute the Mean Square Prediction Error

    \[MSPE = \frac{\sum_{i=1}^{m}\left(y_i-\hat{y}_i\right)^2}{m}\]

    The MSPE should be of the same order of magnitude of MSE.

  1. Perform some type of cross-validation – for example, split the data into two subsets (train + test), the first one is used to build a model and the second one to validate it.
pages
grid_view zoom_out zoom_in swap_horiz autorenew visibility_off
menu
fullscreen
stylus_note ink_eraser delete_forever content_paste navigate_before navigate_next
draw