Linear Models

Lab#4 – Diagnostics and model selection

Author

Paulo Soares

Published

November 24, 2025

1 The diabetes dataset that just won’t go away . . .

We will keep using the dataset introduced in Lab#1 where we had this brief description:

Ten baseline variables, AGE (in years), SEX, body mass index (BMI), average blood pressure (BP), and six blood serum measurements (S1 to S6) were obtained for each of 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline (Y). The data is available in the file diabetes.txt.

url <- "https://web.tecnico.ulisboa.pt/paulo.soares/aml/data/diabetes.txt"
db <- read.delim(url)

1.1 Train & test

Start by splitting the data into a training set and a testing set. The latter should contain 20% of the observations, randomly sampled in a reproducible way. The training set will be used to conduct model selection and diagnostics and the test set will be used to compare the predictive power of a few candidate models.

set.seed(3185)  # for reproducible example

n <- nrow(db)
test.index <- sample(n, 0.2 * n)
db.test <- db[test.index,]
db.train <- db[-test.index,]

1.2 Diagnostics and manual variable selection

Fit the first-order model with all covariates and use the results from the t-tests to select a reduced model. Consider the inclusion of first-order interactions and apply the more common diagnostic techniques to the last selected model.

mod1 <- lm(Y ~ ., data = db.train)

summary(mod1)


Call:
lm(formula = Y ~ ., data = db.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-161.43  -37.86   -0.64   36.88  149.37 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -366.8566    77.1556   -4.75  2.9e-06 ***
AGE           -0.0626     0.2449   -0.26  0.79855    
SEX          -25.1110     6.5398   -3.84  0.00015 ***
BMI            5.7989     0.8052    7.20  3.8e-12 ***
BP             1.1058     0.2570    4.30  2.2e-05 ***
S1            -1.4509     0.6839   -2.12  0.03458 *  
S2             1.0761     0.6343    1.70  0.09070 .  
S3             0.7231     0.8860    0.82  0.41498    
S4             7.4441     6.5200    1.14  0.25436    
S5            79.4815    18.6547    4.26  2.6e-05 ***
S6             0.1823     0.3061    0.60  0.55192    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54.7 on 343 degrees of freedom
Multiple R-squared:  0.521, Adjusted R-squared:  0.507 
F-statistic: 37.3 on 10 and 343 DF,  p-value: <2e-16

mod2 <- lm(Y ~ SEX + BMI + BP + S1 + S2 + S5, data = db.train)

summary(mod2)


Call:
lm(formula = Y ~ SEX + BMI + BP + S1 + S2 + S5, data = db.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-162.98  -38.44   -1.04   36.40  147.89 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -319.246     28.693  -11.13  < 2e-16 ***
SEX          -24.485      6.439   -3.80  0.00017 ***
BMI            5.848      0.792    7.39  1.1e-12 ***
BP             1.089      0.247    4.41  1.4e-05 ***
S1            -1.146      0.249   -4.60  6.0e-06 ***
S2             0.936      0.255    3.67  0.00028 ***
S5            77.393      8.482    9.12  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54.6 on 347 degrees of freedom
Multiple R-squared:  0.519, Adjusted R-squared:  0.51 
F-statistic: 62.4 on 6 and 347 DF,  p-value: <2e-16

mod3 <- lm(Y ~ (SEX + BMI + BP + S1 + S2 + S5)^2, data = db.train)

summary(mod3)


Call:
lm(formula = Y ~ (SEX + BMI + BP + S1 + S2 + S5)^2, data = db.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-165.61  -35.84   -1.95   30.19  169.67 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  3.32e+02   2.54e+02    1.30    0.193  
SEX         -1.25e+02   6.38e+01   -1.96    0.051 .
BMI         -5.92e+00   7.93e+00   -0.75    0.456  
BP          -4.51e+00   2.69e+00   -1.68    0.095 .
S1          -1.05e+00   2.20e+00   -0.48    0.634  
S2           3.48e-01   2.28e+00    0.15    0.879  
S5           1.83e+01   8.34e+01    0.22    0.827  
SEX:BMI      1.32e+00   1.80e+00    0.73    0.465  
SEX:BP       7.28e-01   5.40e-01    1.35    0.179  
SEX:S1       7.52e-01   5.34e-01    1.41    0.160  
SEX:S2      -8.91e-01   5.43e-01   -1.64    0.102  
SEX:S5      -9.37e+00   1.79e+01   -0.52    0.602  
BMI:BP       1.21e-01   5.44e-02    2.22    0.027 *
BMI:S1      -1.54e-01   6.77e-02   -2.28    0.023 *
BMI:S2       1.18e-01   6.82e-02    1.73    0.085 .
BMI:S5       2.87e+00   2.08e+00    1.38    0.167  
BP:S1        2.99e-02   1.92e-02    1.56    0.120  
BP:S2       -2.09e-02   1.99e-02   -1.05    0.296  
BP:S5       -4.07e-01   6.88e-01   -0.59    0.554  
S1:S2       -2.45e-03   2.45e-03   -1.00    0.317  
S1:S5        2.00e-02   3.84e-01    0.05    0.958  
S2:S5        3.25e-01   4.47e-01    0.73    0.467  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53.7 on 332 degrees of freedom
Multiple R-squared:  0.553, Adjusted R-squared:  0.525 
F-statistic: 19.6 on 21 and 332 DF,  p-value: <2e-16

mod4 <- lm(Y ~ SEX + BMI + BP + S1 + S2 + S5 + BMI:BP + BMI:S1, data = db.train)

summary(mod4)


Call:
lm(formula = Y ~ SEX + BMI + BP + S1 + S2 + S5 + BMI:BP + BMI:S1, 
    data = db.train)

Residuals:
   Min     1Q Median     3Q    Max 
-157.7  -36.6   -2.0   34.0  153.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -26.8635   140.3309   -0.19   0.8483    
SEX         -25.4142     6.3804   -3.98  8.3e-05 ***
BMI          -5.3278     5.3157   -1.00   0.3169    
BP           -2.8081     1.3057   -2.15   0.0322 *  
S1           -0.8468     0.6058   -1.40   0.1631    
S2            1.0425     0.2547    4.09  5.3e-05 ***
S5           79.4039     8.4331    9.42  < 2e-16 ***
BMI:BP        0.1438     0.0473    3.04   0.0025 ** 
BMI:S1       -0.0136     0.0212   -0.64   0.5232    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54 on 345 degrees of freedom
Multiple R-squared:  0.531, Adjusted R-squared:  0.52 
F-statistic: 48.9 on 8 and 345 DF,  p-value: <2e-16

par(mfrow=c(2,2))
plot(mod4)

par(mfrow=c(1,1))

par(mfrow=c(1,2))

res <- rstudent(mod4)
hist(res)
boxplot(res)

Homocedasticity

Using the Breusch-Pagan test from the lmtest package:

lmtest::bptest(mod4)


    studentized Breusch-Pagan test

data:  mod4
BP = 38, df = 8, p-value = 7e-06

The homocedasticity null hypothesis is rejected, and so, the residuasl may be correlated with the covariates. However, the existence of any association is not clear:

par(mfrow=c(3,3))

vars <- names(db.train)
vars <- vars[!(vars %in% c("SEX", "Y"))]

for(v in vars) plot(db.train[, v], res, xlab = v)

par(mfrow=c(1,1))

Influential observations

summary(cooks.distance(mod4))

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.000206 0.000950 0.002793 0.003069 0.048550

plot(mod4, which = 4)

The following section is just an illustration of some of the many additional procedures that can be used as diagnostics or corrective measures for a linear model.

Linearity

lmtest::harvtest(mod4)


    Harvey-Collier test

data:  mod4
HC = 0.56, df = 344, p-value = 0.6

lmtest::raintest(mod4)


    Rainbow test

data:  mod4
Rain = 0.92, df1 = 177, df2 = 168, p-value = 0.7

Correlation

lmtest::dwtest(mod4)


    Durbin-Watson test

data:  mod4
DW = 2.1, p-value = 0.9
alternative hypothesis: true autocorrelation is greater than 0

lmtest::bgtest(mod4)


    Breusch-Godfrey test for serial correlation of order up to 1

data:  mod4
LM test = 1.3, df = 1, p-value = 0.3

Normality

shapiro.test(rstudent(mod4))


    Shapiro-Wilk normality test

data:  rstudent(mod4)
W = 1, p-value = 0.3

ppcc::ppccTest(rstudent(mod4), qfn = "qnorm") # Filliben test


    Probability Plot Correlation Coefficient Test

data:  rstudent(mod4)
ppcc = 1, n = 354, p-value = 0.3
alternative hypothesis: rstudent(mod4) differs from a Normal distribution

Box-Cox transformation

# Yt = (Y^lambda - 1) / lambda or log(Y) for lambda==0
bc <- car::boxCox(mod4)

lambda <- bc$x[which.max(bc$y)]
lambda

[1] 0.4242

mod5 <- lm(car::bcnPower(Y, lambda, gamma = 0) ~ SEX + BMI + BP + S1 + S2 + S5 + BMI:BP + BMI:S1, data=db.train)
summary(mod5)


Call:
lm(formula = car::bcnPower(Y, lambda, gamma = 0) ~ SEX + BMI + 
    BP + S1 + S2 + S5 + BMI:BP + BMI:S1, data = db.train)

Residuals:
   Min     1Q Median     3Q    Max 
-9.926 -2.003  0.061  2.072  7.581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.631794   7.975072    0.46    0.649    
SEX         -1.513567   0.362600   -4.17  3.8e-05 ***
BMI         -0.211274   0.302092   -0.70    0.485    
BP          -0.119520   0.074206   -1.61    0.108    
S1          -0.056203   0.034428   -1.63    0.103    
S2           0.065851   0.014477    4.55  7.5e-06 ***
S5           4.719141   0.479258    9.85  < 2e-16 ***
BMI:BP       0.006654   0.002689    2.47    0.014 *  
BMI:S1      -0.000657   0.001207   -0.54    0.587    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.07 on 345 degrees of freedom
Multiple R-squared:  0.525, Adjusted R-squared:  0.514 
F-statistic: 47.6 on 8 and 345 DF,  p-value: <2e-16

Homocedasticity

lmtest::bptest(mod5)


    studentized Breusch-Pagan test

data:  mod5
BP = 23, df = 8, p-value = 0.003

Note: the Box-Cox transformation does not solve the heterocedasticity problem!

1.3 Automatic model search

Use the best subsets approach and a stepwise regression to select two other candidate models.

Best subsets

library(leaps)

best <- regsubsets(Y ~ (.)^2, data = db.train, really.big = TRUE, nbest = 1,
                   nvmax = 8)
info <- summary(best)
plot(2:9, info$cp, xlab = 'P (# of predictors + 1)', ylab = 'Cp')
abline(a = 0, b = 1, col = "blue")

res <- data.frame(P = 2:9,
                  adjR2 = info$adjr2,
                  BIC = info$bic,
                  `Cp-p` = abs(info$cp - 2:9))
res

  P  adjR2    BIC    Cp.p
1 2 0.4566 -205.2 58.2966
2 3 0.4771 -213.9 42.7252
3 4 0.5030 -227.0 23.1883
4 5 0.5190 -233.8 11.0928
5 6 0.5234 -232.2  7.7845
6 7 0.5351 -236.1  0.9397
7 8 0.5396 -234.7  4.2367
8 9 0.5451 -234.1  8.3252

columns <- info$which[6,]
columns[columns == TRUE]

(Intercept)         AGE         SEX     AGE:SEX      BMI:S1      BMI:S5 
       TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
      BP:S2 
       TRUE

mod6 <- lm(Y ~ AGE + SEX + AGE:SEX + BMI:S1 + BMI:S5 + BP:S2, data = db.train)
summary(mod6)


Call:
lm(formula = Y ~ AGE + SEX + AGE:SEX + BMI:S1 + BMI:S5 + BP:S2, 
    data = db.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-176.68  -37.94   -0.63   33.96  134.51 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.46e+01   3.34e+01    1.34  0.18240    
AGE         -2.46e+00   6.80e-01   -3.62  0.00034 ***
SEX         -1.06e+02   2.25e+01   -4.69  3.9e-06 ***
AGE:SEX      1.65e+00   4.42e-01    3.74  0.00022 ***
BMI:S1      -4.15e-02   6.24e-03   -6.65  1.2e-10 ***
BMI:S5       2.95e+00   2.07e-01   14.23  < 2e-16 ***
BP:S2        9.78e-03   1.58e-03    6.17  1.9e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53.2 on 347 degrees of freedom
Multiple R-squared:  0.543, Adjusted R-squared:  0.535 
F-statistic: 68.7 on 6 and 347 DF,  p-value: <2e-16

Stepwise regression

Backward

initial_model <- lm(Y ~ (.)^2, data = db.train)
mod7 <- step(initial_model, trace = 0)

summary(mod7)

Both

initial_model <- lm(Y ~ ., data = db.train)
mod7 <- step(initial_model, scope = list(lower = Y ~ 1, upper = Y ~ (.)^2), trace = 0, direction = "both")

summary(mod7)


Call:
lm(formula = Y ~ AGE + SEX + BMI + BP + S1 + S2 + S3 + S5 + S6 + 
    AGE:SEX + BMI:BP + AGE:BP + S5:S6 + BP:S6 + AGE:S3 + SEX:BMI, 
    data = db.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-156.73  -35.18   -0.24   32.72  147.72 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  704.8808   241.1628    2.92  0.00370 ** 
AGE           -8.2707     2.2162   -3.73  0.00022 ***
SEX         -153.9620    41.4571   -3.71  0.00024 ***
BMI           -9.9224     4.9040   -2.02  0.04383 *  
BP            -0.8328     2.2236   -0.37  0.70824    
S1            -2.1220     0.7234   -2.93  0.00358 ** 
S2             1.9878     0.7103    2.80  0.00543 ** 
S3            -0.5371     1.3404   -0.40  0.68888    
S5           -51.8419    50.6249   -1.02  0.30655    
S6            -3.1998     2.5259   -1.27  0.20610    
AGE:SEX        1.5870     0.4783    3.32  0.00101 ** 
BMI:BP         0.1374     0.0514    2.67  0.00784 ** 
AGE:BP         0.0477     0.0181    2.63  0.00885 ** 
S5:S6          1.6427     0.6081    2.70  0.00725 ** 
BP:S6         -0.0447     0.0231   -1.93  0.05391 .  
AGE:S3         0.0313     0.0207    1.51  0.13211    
SEX:BMI        1.9754     1.4100    1.40  0.16213    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52.3 on 337 degrees of freedom
Multiple R-squared:  0.571, Adjusted R-squared:  0.551 
F-statistic:   28 on 16 and 337 DF,  p-value: <2e-16

1.4 Model comparison

Use the three previously selected models to predict the response variable in the test set and compare the models using the \(PRESS_p\) measure and the root mean square deviation between the observed and the predicted values of the response variable.

models <- list(mod4, mod6, mod7)

measures <- function(model) {
  summ <- summary(model)
  
  pred <- predict(model, db.test)
  dev <- pred - db.test$Y
  
  data.frame(df = summ$df[1],
             adjR2 = summ$adj.r.squared,
             press = sum((residuals(model) / (1 - diag(hatvalues(model))))^2),
             rmspe = sqrt(mean(dev^2)))
}

results <- do.call(rbind, lapply(models, measures))
results

  df  adjR2     press rmspe
1  9 0.5205 356054622 52.63
2  7 0.5351 347187035 53.39
3 17 0.5507 325947316 55.54