Linear Models
Lab#4 – Diagnostics and model selection
1 The diabetes dataset that just won’t go away . . .
We will keep using the dataset introduced in Lab#1 where we had this brief description:
Ten baseline variables,
AGE(in years),SEX, body mass index (BMI), average blood pressure (BP), and six blood serum measurements (S1toS6) were obtained for each of 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline (Y). The data is available in the file diabetes.txt.
url <- "https://web.tecnico.ulisboa.pt/paulo.soares/aml/data/diabetes.txt"
db <- read.delim(url)1.1 Train & test
Start by splitting the data into a training set and a testing set. The latter should contain 20% of the observations, randomly sampled in a reproducible way. The training set will be used to conduct model selection and diagnostics and the test set will be used to compare the predictive power of a few candidate models.
set.seed(3185) # for reproducible example
n <- nrow(db)
test.index <- sample(n, 0.2 * n)
db.test <- db[test.index,]
db.train <- db[-test.index,]1.2 Diagnostics and manual variable selection
Fit the first-order model with all covariates and use the results from the t-tests to select a reduced model. Consider the inclusion of first-order interactions and apply the more common diagnostic techniques to the last selected model.
mod1 <- lm(Y ~ ., data = db.train)
summary(mod1)
Call:
lm(formula = Y ~ ., data = db.train)
Residuals:
Min 1Q Median 3Q Max
-161.43 -37.86 -0.64 36.88 149.37
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -366.8566 77.1556 -4.75 2.9e-06 ***
AGE -0.0626 0.2449 -0.26 0.79855
SEX -25.1110 6.5398 -3.84 0.00015 ***
BMI 5.7989 0.8052 7.20 3.8e-12 ***
BP 1.1058 0.2570 4.30 2.2e-05 ***
S1 -1.4509 0.6839 -2.12 0.03458 *
S2 1.0761 0.6343 1.70 0.09070 .
S3 0.7231 0.8860 0.82 0.41498
S4 7.4441 6.5200 1.14 0.25436
S5 79.4815 18.6547 4.26 2.6e-05 ***
S6 0.1823 0.3061 0.60 0.55192
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 54.7 on 343 degrees of freedom
Multiple R-squared: 0.521, Adjusted R-squared: 0.507
F-statistic: 37.3 on 10 and 343 DF, p-value: <2e-16
mod2 <- lm(Y ~ SEX + BMI + BP + S1 + S2 + S5, data = db.train)
summary(mod2)
Call:
lm(formula = Y ~ SEX + BMI + BP + S1 + S2 + S5, data = db.train)
Residuals:
Min 1Q Median 3Q Max
-162.98 -38.44 -1.04 36.40 147.89
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -319.246 28.693 -11.13 < 2e-16 ***
SEX -24.485 6.439 -3.80 0.00017 ***
BMI 5.848 0.792 7.39 1.1e-12 ***
BP 1.089 0.247 4.41 1.4e-05 ***
S1 -1.146 0.249 -4.60 6.0e-06 ***
S2 0.936 0.255 3.67 0.00028 ***
S5 77.393 8.482 9.12 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 54.6 on 347 degrees of freedom
Multiple R-squared: 0.519, Adjusted R-squared: 0.51
F-statistic: 62.4 on 6 and 347 DF, p-value: <2e-16
mod3 <- lm(Y ~ (SEX + BMI + BP + S1 + S2 + S5)^2, data = db.train)
summary(mod3)
Call:
lm(formula = Y ~ (SEX + BMI + BP + S1 + S2 + S5)^2, data = db.train)
Residuals:
Min 1Q Median 3Q Max
-165.61 -35.84 -1.95 30.19 169.67
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.32e+02 2.54e+02 1.30 0.193
SEX -1.25e+02 6.38e+01 -1.96 0.051 .
BMI -5.92e+00 7.93e+00 -0.75 0.456
BP -4.51e+00 2.69e+00 -1.68 0.095 .
S1 -1.05e+00 2.20e+00 -0.48 0.634
S2 3.48e-01 2.28e+00 0.15 0.879
S5 1.83e+01 8.34e+01 0.22 0.827
SEX:BMI 1.32e+00 1.80e+00 0.73 0.465
SEX:BP 7.28e-01 5.40e-01 1.35 0.179
SEX:S1 7.52e-01 5.34e-01 1.41 0.160
SEX:S2 -8.91e-01 5.43e-01 -1.64 0.102
SEX:S5 -9.37e+00 1.79e+01 -0.52 0.602
BMI:BP 1.21e-01 5.44e-02 2.22 0.027 *
BMI:S1 -1.54e-01 6.77e-02 -2.28 0.023 *
BMI:S2 1.18e-01 6.82e-02 1.73 0.085 .
BMI:S5 2.87e+00 2.08e+00 1.38 0.167
BP:S1 2.99e-02 1.92e-02 1.56 0.120
BP:S2 -2.09e-02 1.99e-02 -1.05 0.296
BP:S5 -4.07e-01 6.88e-01 -0.59 0.554
S1:S2 -2.45e-03 2.45e-03 -1.00 0.317
S1:S5 2.00e-02 3.84e-01 0.05 0.958
S2:S5 3.25e-01 4.47e-01 0.73 0.467
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53.7 on 332 degrees of freedom
Multiple R-squared: 0.553, Adjusted R-squared: 0.525
F-statistic: 19.6 on 21 and 332 DF, p-value: <2e-16
mod4 <- lm(Y ~ SEX + BMI + BP + S1 + S2 + S5 + BMI:BP + BMI:S1, data = db.train)
summary(mod4)
Call:
lm(formula = Y ~ SEX + BMI + BP + S1 + S2 + S5 + BMI:BP + BMI:S1,
data = db.train)
Residuals:
Min 1Q Median 3Q Max
-157.7 -36.6 -2.0 34.0 153.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -26.8635 140.3309 -0.19 0.8483
SEX -25.4142 6.3804 -3.98 8.3e-05 ***
BMI -5.3278 5.3157 -1.00 0.3169
BP -2.8081 1.3057 -2.15 0.0322 *
S1 -0.8468 0.6058 -1.40 0.1631
S2 1.0425 0.2547 4.09 5.3e-05 ***
S5 79.4039 8.4331 9.42 < 2e-16 ***
BMI:BP 0.1438 0.0473 3.04 0.0025 **
BMI:S1 -0.0136 0.0212 -0.64 0.5232
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 54 on 345 degrees of freedom
Multiple R-squared: 0.531, Adjusted R-squared: 0.52
F-statistic: 48.9 on 8 and 345 DF, p-value: <2e-16
Homocedasticity
Using the Breusch-Pagan test from the lmtest package:
lmtest::bptest(mod4)
studentized Breusch-Pagan test
data: mod4
BP = 38, df = 8, p-value = 7e-06
The homocedasticity null hypothesis is rejected, and so, the residuasl may be correlated with the covariates. However, the existence of any association is not clear:
par(mfrow=c(3,3))
vars <- names(db.train)
vars <- vars[!(vars %in% c("SEX", "Y"))]
for(v in vars) plot(db.train[, v], res, xlab = v)par(mfrow=c(1,1))Influential observations
summary(cooks.distance(mod4)) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000000 0.000206 0.000950 0.002793 0.003069 0.048550
plot(mod4, which = 4)The following section is just an illustration of some of the many additional procedures that can be used as diagnostics or corrective measures for a linear model.
Linearity
lmtest::harvtest(mod4)
Harvey-Collier test
data: mod4
HC = 0.56, df = 344, p-value = 0.6
lmtest::raintest(mod4)
Rainbow test
data: mod4
Rain = 0.92, df1 = 177, df2 = 168, p-value = 0.7
Correlation
lmtest::dwtest(mod4)
Durbin-Watson test
data: mod4
DW = 2.1, p-value = 0.9
alternative hypothesis: true autocorrelation is greater than 0
lmtest::bgtest(mod4)
Breusch-Godfrey test for serial correlation of order up to 1
data: mod4
LM test = 1.3, df = 1, p-value = 0.3
Normality
shapiro.test(rstudent(mod4))
Shapiro-Wilk normality test
data: rstudent(mod4)
W = 1, p-value = 0.3
ppcc::ppccTest(rstudent(mod4), qfn = "qnorm") # Filliben test
Probability Plot Correlation Coefficient Test
data: rstudent(mod4)
ppcc = 1, n = 354, p-value = 0.3
alternative hypothesis: rstudent(mod4) differs from a Normal distribution
Box-Cox transformation
# Yt = (Y^lambda - 1) / lambda or log(Y) for lambda==0
bc <- car::boxCox(mod4)lambda <- bc$x[which.max(bc$y)]
lambda[1] 0.4242
mod5 <- lm(car::bcnPower(Y, lambda, gamma = 0) ~ SEX + BMI + BP + S1 + S2 + S5 + BMI:BP + BMI:S1, data=db.train)
summary(mod5)
Call:
lm(formula = car::bcnPower(Y, lambda, gamma = 0) ~ SEX + BMI +
BP + S1 + S2 + S5 + BMI:BP + BMI:S1, data = db.train)
Residuals:
Min 1Q Median 3Q Max
-9.926 -2.003 0.061 2.072 7.581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.631794 7.975072 0.46 0.649
SEX -1.513567 0.362600 -4.17 3.8e-05 ***
BMI -0.211274 0.302092 -0.70 0.485
BP -0.119520 0.074206 -1.61 0.108
S1 -0.056203 0.034428 -1.63 0.103
S2 0.065851 0.014477 4.55 7.5e-06 ***
S5 4.719141 0.479258 9.85 < 2e-16 ***
BMI:BP 0.006654 0.002689 2.47 0.014 *
BMI:S1 -0.000657 0.001207 -0.54 0.587
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.07 on 345 degrees of freedom
Multiple R-squared: 0.525, Adjusted R-squared: 0.514
F-statistic: 47.6 on 8 and 345 DF, p-value: <2e-16
Homocedasticity
lmtest::bptest(mod5)
studentized Breusch-Pagan test
data: mod5
BP = 23, df = 8, p-value = 0.003
Note: the Box-Cox transformation does not solve the heterocedasticity problem!
1.3 Automatic model search
Use the best subsets approach and a stepwise regression to select two other candidate models.
Best subsets
library(leaps)
best <- regsubsets(Y ~ (.)^2, data = db.train, really.big = TRUE, nbest = 1,
nvmax = 8)
info <- summary(best)
plot(2:9, info$cp, xlab = 'P (# of predictors + 1)', ylab = 'Cp')
abline(a = 0, b = 1, col = "blue")res <- data.frame(P = 2:9,
adjR2 = info$adjr2,
BIC = info$bic,
`Cp-p` = abs(info$cp - 2:9))
res P adjR2 BIC Cp.p
1 2 0.4566 -205.2 58.2966
2 3 0.4771 -213.9 42.7252
3 4 0.5030 -227.0 23.1883
4 5 0.5190 -233.8 11.0928
5 6 0.5234 -232.2 7.7845
6 7 0.5351 -236.1 0.9397
7 8 0.5396 -234.7 4.2367
8 9 0.5451 -234.1 8.3252
columns <- info$which[6,]
columns[columns == TRUE](Intercept) AGE SEX AGE:SEX BMI:S1 BMI:S5
TRUE TRUE TRUE TRUE TRUE TRUE
BP:S2
TRUE
mod6 <- lm(Y ~ AGE + SEX + AGE:SEX + BMI:S1 + BMI:S5 + BP:S2, data = db.train)
summary(mod6)
Call:
lm(formula = Y ~ AGE + SEX + AGE:SEX + BMI:S1 + BMI:S5 + BP:S2,
data = db.train)
Residuals:
Min 1Q Median 3Q Max
-176.68 -37.94 -0.63 33.96 134.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.46e+01 3.34e+01 1.34 0.18240
AGE -2.46e+00 6.80e-01 -3.62 0.00034 ***
SEX -1.06e+02 2.25e+01 -4.69 3.9e-06 ***
AGE:SEX 1.65e+00 4.42e-01 3.74 0.00022 ***
BMI:S1 -4.15e-02 6.24e-03 -6.65 1.2e-10 ***
BMI:S5 2.95e+00 2.07e-01 14.23 < 2e-16 ***
BP:S2 9.78e-03 1.58e-03 6.17 1.9e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53.2 on 347 degrees of freedom
Multiple R-squared: 0.543, Adjusted R-squared: 0.535
F-statistic: 68.7 on 6 and 347 DF, p-value: <2e-16
Stepwise regression
Backward
initial_model <- lm(Y ~ (.)^2, data = db.train)
mod7 <- step(initial_model, trace = 0)
summary(mod7)Both
initial_model <- lm(Y ~ ., data = db.train)
mod7 <- step(initial_model, scope = list(lower = Y ~ 1, upper = Y ~ (.)^2), trace = 0, direction = "both")
summary(mod7)
Call:
lm(formula = Y ~ AGE + SEX + BMI + BP + S1 + S2 + S3 + S5 + S6 +
AGE:SEX + BMI:BP + AGE:BP + S5:S6 + BP:S6 + AGE:S3 + SEX:BMI,
data = db.train)
Residuals:
Min 1Q Median 3Q Max
-156.73 -35.18 -0.24 32.72 147.72
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 704.8808 241.1628 2.92 0.00370 **
AGE -8.2707 2.2162 -3.73 0.00022 ***
SEX -153.9620 41.4571 -3.71 0.00024 ***
BMI -9.9224 4.9040 -2.02 0.04383 *
BP -0.8328 2.2236 -0.37 0.70824
S1 -2.1220 0.7234 -2.93 0.00358 **
S2 1.9878 0.7103 2.80 0.00543 **
S3 -0.5371 1.3404 -0.40 0.68888
S5 -51.8419 50.6249 -1.02 0.30655
S6 -3.1998 2.5259 -1.27 0.20610
AGE:SEX 1.5870 0.4783 3.32 0.00101 **
BMI:BP 0.1374 0.0514 2.67 0.00784 **
AGE:BP 0.0477 0.0181 2.63 0.00885 **
S5:S6 1.6427 0.6081 2.70 0.00725 **
BP:S6 -0.0447 0.0231 -1.93 0.05391 .
AGE:S3 0.0313 0.0207 1.51 0.13211
SEX:BMI 1.9754 1.4100 1.40 0.16213
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52.3 on 337 degrees of freedom
Multiple R-squared: 0.571, Adjusted R-squared: 0.551
F-statistic: 28 on 16 and 337 DF, p-value: <2e-16
1.4 Model comparison
Use the three previously selected models to predict the response variable in the test set and compare the models using the \(PRESS_p\) measure and the root mean square deviation between the observed and the predicted values of the response variable.
models <- list(mod4, mod6, mod7)
measures <- function(model) {
summ <- summary(model)
pred <- predict(model, db.test)
dev <- pred - db.test$Y
data.frame(df = summ$df[1],
adjR2 = summ$adj.r.squared,
press = sum((residuals(model) / (1 - diag(hatvalues(model))))^2),
rmspe = sqrt(mean(dev^2)))
}
results <- do.call(rbind, lapply(models, measures))
results df adjR2 press rmspe
1 9 0.5205 356054622 52.63
2 7 0.5351 347187035 53.39
3 17 0.5507 325947316 55.54