\(\def\bs#1{\boldsymbol{#1}} \def\b#1{\mathbf{#1}} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\r}{r} \DeclareMathOperator{\det}{det}\)
Potential problems
The error structure is not adequate
Variance not constant (heterocedasticity)
Correlated errors
Normality assumption
Existence of discordant observations (outliers) or observations with an excessive influence on the fitted model
Large correlations between covariates (multicollinearity)
Many diagnostic tools are based on the residuals \(\b{e}=\b{y}-\hat{\b{y}}\):
Graphical tools (informal diagnostics)
Plot of residuals against \(\hat{\b{y}}\)
Plots of residuals against \(\b{x}_i\)
Histograms, box-plots, QQ-plots, . . .
More formal diagnostics: numerical measures, tests of hypotheses, . . .
Usual strategy
Start with some graphical tools and use tests just to clarify suspicious or unclear indications from the plots.
The residuals
Under the working assumptions, a plot of the residuals should look like a symmetrical cloud of points around zero with constant variation.
We have seen that
\[\b{e}=(\b{I} - \b{H})\b{y}\sim N_n(\b{0}, \sigma^2(\b{I} - \b{H}))\]
\(\implies e_i\sim N(0, \sigma^2(1-h_{ii}))\)
\(\therefore\) the residuals are correlated r.v.’s with different variances
Studentized Residuals
\[r_i=\frac{e_i}{se(e_i)}=\frac{e_i}{\hat{\sigma}\sqrt{1-h_{ii}}},\ i=1,\ldots, n\]
are called the internally Studentized residuals.
Note: rstandard() in R
Let \(d_i=y_i-\hat{y}_{-i}\) where \(\hat{y}_{-i}\) is the fitted value after the removal of the \(i^{th}\) observation (deleted residual).
Fortunately, \(d_i=\frac{e_i}{1-h_{ii}}\) and so,
\[\begin{align*} r_i^* & = \frac{e_i}{\sqrt{MSE_{-i} (1-h_{ii})}}=e_i\sqrt{\frac{n-p-1}{SSE (1-h_{ii})-e_i^2}}=\\[0.25cm] & = r_i \sqrt{\frac{n-p-1}{n-p- r_i^2}}\sim t_{(n-p-1)},\ i=1,\ldots, n \end{align*}\]
are called the externally Studentized residuals.
Note: rstudent() in R
In a residuals plot we should look for:
skewness (B3)
extreme values (B4)
some clear dependency on any covariate regarding the previous points (A)
Problem A (linear predictor)
Diagnostics:
residuals plots
Lack-of-fit test (requires repeated measures)
Measures:
add new terms to the model
transformation of covariates
try other models
Problem B1 (heterocedasticity)
Diagnostics:
residuals plots
Breusch-Pagan test
Measures:
transformation of covariates
transformation of \(y\) (Box-Cox or Yeo-Johnson families of transformations)
change the error structure and use weighted least squares (WLS)
Problem B2 (correlated errors)
Diagnostics:
residuals plots
Runs test for randomness, Durbin-Watson test
Measures:
work with first differences \(y'_t=y_t-\hat{\rho}y_{t-1}\), where \(\hat{\rho}\) is an estimate of the autocorrelation
WLS or robust regression
Problem B3 (non-normality)
Diagnostics:
residuals plots
histograms, box-plots, QQ-plots of residuals
Shapiro-Wilks, Filliben, Lilliefors, . . . tests
Measures:
transformations
use models with correlated errors
Problem C (unusual observations)
Outliers and Leverage points
Influential observations – leverage \(\times\) residual
Remember that \(Var[e_i]=\sigma^2 (1- h_{ii})\).
\(\leadsto\) a large value of \(h_{ii}\) will make \(Var[e_i]\) small and, consequently, the fitted response surface will be pulled towards \((y_i, \b{x}_i)\)
We will call \(h_{ii}\) the leverage of the \(i^{th}\) observation.
Influence on fitted values
We call \(DFFIT_i=\hat{y}_i-\hat{y}_{-i}\) a difference in fitted-value and use
\[DFFITS_i=\frac{DFFIT_i}{\hat{\sigma}_{-i}\sqrt{h_{ii}}}=r^*_i\sqrt{\frac{h_{ii}}{1-h_{ii}}}\]
as a measure of the influence that observation \(i\) has on the fitted value \(\hat{y}_i\).
An observation can be considered influential if:
small sample size and \(|DFFITS_i| >1\)
large sample size and \(|DFFITS_i| >2\sqrt{\frac{p}{n}}\)
\(\leadsto\) We need to look at \(n\) \(DFFITS\)!
Influence on the regression coefficients
A similar thing can be done for the difference in estimated regression coefficients, that leads to
\[DFBETAS_{k,i}=\frac{\hat{\beta}_k-\hat{\beta}_{k,-i}}{\hat{\sigma}_{-i}\sqrt{c_{kk}}}\]
where \(c_{kk}\) is the \(k^{th}\) element of the main diagonal of \((\b{X}'\b{X})^{-1}\).
An observation can be considered influential if:
small sample size and \(|DFBETAS_{k,i}| >1\)
large sample size and \(|DFBETAS_{k,i}| >\frac{2}{\sqrt{n}}\)
\(\leadsto\) Now we have \(n\times p\) \(DFBETAS\)!
Overall influence
The Cook’s distance is defined as
\[D_i=\frac{(\hat{\b{y}} - \hat{\b{y}}_{-i})'(\hat{\b{y}} - \hat{\b{y}}_{-i})}{p MSE}=\frac{1}{p} (r^*_i)^2\frac{h_{ii}}{1-h_{ii}}\]
and is used as an aggregate measure of influence.
Note Usually the values of \(D_i\) are compared with the median of \(F_{(p, n-p)}\).
leverage std.residual cooks.d
High leverage, Low residual 0.4096 -0.163 0.00926
Low leverage, High residual 0.0478 3.616 0.32819
High leverage, High residual 0.4096 -3.916 5.31837
Measures:
Further investigation: is there an explanation? Is it an error? Or a fraud?
Consider the removal with caution.
Problem D (multicollinearity)
The existence of strong correlations between covariates can lead to:
unstable estimation
standard errors inflation \(\implies\) unreliable inferences
Diagnostics:
check the correlation matrix of the covariates
check for small eigenvalues of \(\b{X}'\b{X}\)
the Variance Inflation Factors, \(VIF_k=\frac{1}{1-R_k^2}\), where \(R_k^2\) is the coefficient of determination when \(x_k\) is regressed on all other covariates.
In the absence of linear correlation \(VIF_k=1\).
The mean, \(\overline{VIF}\), is also used as a global measure of multicollinearity.
A small dataset already used in Chapter2:
demand education income urbanization
1 167 11.2 31.9 42.2
2 174 10.6 13.2 48.6
3 161 10.6 28.7 42.6
4 162 10.4 26.1 39.0
5 141 9.3 30.1 34.7
6 175 10.8 8.5 44.5
7 164 10.7 24.3 39.1
8 174 10.0 18.6 40.1
9 186 12.0 20.4 45.9
Call:
lm(formula = demand ~ ., data = product)
Residuals:
1 2 3 4 5 6 7 8 9
0.876 -0.972 -1.492 0.762 -4.612 -5.462 -2.129 11.653 1.377
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.014 36.191 1.66 0.158
education 10.718 4.530 2.37 0.064 .
income -0.751 0.395 -1.90 0.116
urbanization 0.240 1.012 0.24 0.822
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.29 on 5 degrees of freedom
Multiple R-squared: 0.845, Adjusted R-squared: 0.753
F-statistic: 9.11 on 3 and 5 DF, p-value: 0.0181
Measures:
Consider the replacement or removal of covariates with caution.
Standardize the variables and use ridge regression.
To look for the best possible regression model we need:
Limitations in the inclusion of terms in a regression model
Common criteria
t-tests and F-tests
\(R^2_a \uparrow \iff MSE \downarrow\)
Mallow’s \(C_p\) \((|C_p - p| \downarrow)\)
Akaike Information Criterion + . . . (\(AIC \downarrow\))
\(PRESS_p=\sum_{i=1}^n(y_i-\hat{y}_{-i})^2 \downarrow\)
Many others . . .
Automatic search procedures
Best subsets algorithms – finding the “best” subsets of covariates without examining all possible subsets
R: leaps package
Stepwise regression – develops a sequence of regression models, at each step adding or removing a term. There are three variants: forward, backward and both.
R: step() or MASS::stepAIC()
Notes
The final step of a regression analysis is usually the validation of the selected model.
Three basic forms of model validation:
Use the model to predict new data and evalute its predictive power.
Compute the Mean Square Prediction Error
\[MSPE = \frac{\sum_{i=1}^{m}\left(y_i-\hat{y}_i\right)^2}{m}\]
The MSPE should be of the same order of magnitude of MSE.