Building the regression model

3.1 Diagnostics and corretive measures

Potential problems

The linear predictor is not adequate

The error structure is not adequate
1. Variance not constant (heterocedasticity)
2. Correlated errors
3. Normality assumption

Existence of discordant observations (outliers) or observations with an excessive influence on the fitted model
Large correlations between covariates (multicollinearity)

Many diagnostic tools are based on the residuals \(\b{e}=\b{y}-\hat{\b{y}}\):

Graphical tools (informal diagnostics)
- Plot of residuals against \(\hat{\b{y}}\)
- Plots of residuals against \(\b{x}_i\)
- Histograms, box-plots, QQ-plots, . . .
More formal diagnostics: numerical measures, tests of hypotheses, . . .

Usual strategy

Start with some graphical tools and use tests just to clarify suspicious or unclear indications from the plots.

The residuals

Under the working assumptions, a plot of the residuals should look like a symmetrical cloud of points around zero with constant variation.

We have seen that

\[\b{e}=(\b{I} - \b{H})\b{y}\sim N_n(\b{0}, \sigma^2(\b{I} - \b{H}))\]

\(\implies e_i\sim N(0, \sigma^2(1-h_{ii}))\)

\(\therefore\) the residuals are correlated r.v.’s with different variances

Studentized Residuals

\[r_i=\frac{e_i}{se(e_i)}=\frac{e_i}{\hat{\sigma}\sqrt{1-h_{ii}}},\ i=1,\ldots, n\]

are called the internally Studentized residuals.

Note: rstandard() in R

Let \(d_i=y_i-\hat{y}_{-i}\) where \(\hat{y}_{-i}\) is the fitted value after the removal of the \(i^{th}\) observation (deleted residual).

Fortunately, \(d_i=\frac{e_i}{1-h_{ii}}\) and so,

\[\begin{align*} r_i^* & = \frac{e_i}{\sqrt{MSE_{-i} (1-h_{ii})}}=e_i\sqrt{\frac{n-p-1}{SSE (1-h_{ii})-e_i^2}}=\\[0.25cm] & = r_i \sqrt{\frac{n-p-1}{n-p- r_i^2}}\sim t_{(n-p-1)},\ i=1,\ldots, n \end{align*}\]

are called the externally Studentized residuals.

Note: rstudent() in R

In a residuals plot we should look for:

deviation from zero in some systematic way or some clear pattern (A, B1 and B2)

skewness (B3)
extreme values (B4)
some clear dependency on any covariate regarding the previous points (A)

Problem A (linear predictor)

Diagnostics:

residuals plots
Lack-of-fit test (requires repeated measures)

Measures:

add new terms to the model
transformation of covariates
try other models

Problem B1 (heterocedasticity)

Diagnostics:

residuals plots
Breusch-Pagan test

Measures:

transformation of covariates
transformation of \(y\) (Box-Cox or Yeo-Johnson families of transformations)
change the error structure and use weighted least squares (WLS)

Problem B2 (correlated errors)

Diagnostics:

residuals plots
Runs test for randomness, Durbin-Watson test

Measures:

work with first differences \(y'_t=y_t-\hat{\rho}y_{t-1}\), where \(\hat{\rho}\) is an estimate of the autocorrelation
WLS or robust regression

Problem B3 (non-normality)

Diagnostics:

residuals plots
histograms, box-plots, QQ-plots of residuals
Shapiro-Wilks, Filliben, Lilliefors, . . . tests

Measures:

transformations
use models with correlated errors

Problem C (unusual observations)

Outliers and Leverage points

Influential observations – leverage \(\times\) residual

Remember that \(Var[e_i]=\sigma^2 (1- h_{ii})\).

\(\leadsto\) a large value of \(h_{ii}\) will make \(Var[e_i]\) small and, consequently, the fitted response surface will be pulled towards \((y_i, \b{x}_i)\)

We will call \(h_{ii}\) the leverage of the \(i^{th}\) observation.

Influence on fitted values

We call \(DFFIT_i=\hat{y}_i-\hat{y}_{-i}\) a difference in fitted-value and use

\[DFFITS_i=\frac{DFFIT_i}{\hat{\sigma}_{-i}\sqrt{h_{ii}}}=r^*_i\sqrt{\frac{h_{ii}}{1-h_{ii}}}\]

as a measure of the influence that observation \(i\) has on the fitted value \(\hat{y}_i\).

An observation can be considered influential if:

small sample size and \(|DFFITS_i| >1\)
large sample size and \(|DFFITS_i| >2\sqrt{\frac{p}{n}}\)

\(\leadsto\) We need to look at \(n\) \(DFFITS\)!

dffits(fitted_model)

Influence on the regression coefficients

A similar thing can be done for the difference in estimated regression coefficients, that leads to

\[DFBETAS_{k,i}=\frac{\hat{\beta}_k-\hat{\beta}_{k,-i}}{\hat{\sigma}_{-i}\sqrt{c_{kk}}}\]

where \(c_{kk}\) is the \(k^{th}\) element of the main diagonal of \((\b{X}'\b{X})^{-1}\).

An observation can be considered influential if:

small sample size and \(|DFBETAS_{k,i}| >1\)
large sample size and \(|DFBETAS_{k,i}| >\frac{2}{\sqrt{n}}\)

\(\leadsto\) Now we have \(n\times p\) \(DFBETAS\)!

dfbetas(fitted_model)

Overall influence

The Cook’s distance is defined as

\[D_i=\frac{(\hat{\b{y}} - \hat{\b{y}}_{-i})'(\hat{\b{y}} - \hat{\b{y}}_{-i})}{p MSE}=\frac{1}{p} (r^*_i)^2\frac{h_{ii}}{1-h_{ii}}\]

and is used as an aggregate measure of influence.

Note Usually the values of \(D_i\) are compared with the median of \(F_{(p, n-p)}\).

                             leverage std.residual cooks.d
High leverage, Low residual    0.4096       -0.163 0.00926
Low leverage, High residual    0.0478        3.616 0.32819
High leverage, High residual   0.4096       -3.916 5.31837

Measures:

Further investigation: is there an explanation? Is it an error? Or a fraud?
Consider the removal with caution.

Problem D (multicollinearity)

The existence of strong correlations between covariates can lead to:

unstable estimation
standard errors inflation \(\implies\) unreliable inferences

Diagnostics:

check the correlation matrix of the covariates
check for small eigenvalues of \(\b{X}'\b{X}\)
the Variance Inflation Factors, \(VIF_k=\frac{1}{1-R_k^2}\), where \(R_k^2\) is the coefficient of determination when \(x_k\) is regressed on all other covariates.

In the absence of linear correlation \(VIF_k=1\).

The mean, \(\overline{VIF}\), is also used as a global measure of multicollinearity.

A small dataset already used in Chapter2:

  demand education income urbanization
1    167      11.2   31.9         42.2
2    174      10.6   13.2         48.6
3    161      10.6   28.7         42.6
4    162      10.4   26.1         39.0
5    141       9.3   30.1         34.7
6    175      10.8    8.5         44.5
7    164      10.7   24.3         39.1
8    174      10.0   18.6         40.1
9    186      12.0   20.4         45.9

model <- lm(demand ~ ., data=product)
summary(model)


Call:
lm(formula = demand ~ ., data = product)

Residuals:
     1      2      3      4      5      6      7      8      9 
 0.876 -0.972 -1.492  0.762 -4.612 -5.462 -2.129 11.653  1.377 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)    60.014     36.191    1.66    0.158  
education      10.718      4.530    2.37    0.064 .
income         -0.751      0.395   -1.90    0.116  
urbanization    0.240      1.012    0.24    0.822  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.29 on 5 degrees of freedom
Multiple R-squared:  0.845, Adjusted R-squared:  0.753 
F-statistic: 9.11 on 3 and 5 DF,  p-value: 0.0181

X <- model.matrix(demand ~ ., data=product)
eigen(t(X) %*% X)$values

[1] 2.14e+04 5.59e+02 2.10e+00 3.01e-02

car::vif(model)

   education       income urbanization 
        2.31         1.98         3.61

reduced_model <- update(model, . ~ . - urbanization)
car::vif(reduced_model)

education    income 
     1.03      1.03

Measures:

Consider the replacement or removal of covariates with caution.
Standardize the variables and use ridge regression.

3.2 Selection of variables

To look for the best possible regression model we need:

A set of candidate models

An evaluation or comparison criterion

Limitations in the inclusion of terms in a regression model

We should have \(n \gg p\), otherwise the risk of overfitting increases.

Usually impossible to consider all \(2^{k}\) subsets of covariates.

Common criteria

t-tests and F-tests
\(R^2_a \uparrow \iff MSE \downarrow\)
Mallow’s \(C_p\) \((|C_p - p| \downarrow)\)
Akaike Information Criterion + . . . (\(AIC \downarrow\))
\(PRESS_p=\sum_{i=1}^n(y_i-\hat{y}_{-i})^2 \downarrow\)
Many others . . .

Automatic search procedures

Best subsets algorithms – finding the “best” subsets of covariates without examining all possible subsets

R: leaps package

Stepwise regression – develops a sequence of regression models, at each step adding or removing a term. There are three variants: forward, backward and both.

R: step() or MASS::stepAIC()

Notes

Unlike the best subsets algorithms, stepwise regression outputs a single model.

Stepwise regression faces heavy criticisms, producing inflation in the number of covariates, overestimation in the regression coefficients, underestimation in standard errors, . . .

3.3 Model validation

The final step of a regression analysis is usually the validation of the selected model.

Three basic forms of model validation:

Compare the results with theoretically expected, empirical or simulated results.

Use the model to predict new data and evalute its predictive power.

Compute the Mean Square Prediction Error

\[MSPE = \frac{\sum_{i=1}^{m}\left(y_i-\hat{y}_i\right)^2}{m}\]

The MSPE should be of the same order of magnitude of MSE.

3 Building the regression model

3.1 Diagnostics and corretive measures

3.2 Selection of variables

3.3 Model validation