Linear Models

Lab#1 – Simple linear regression model

Author

Paulo Soares

Published

October 8, 2025

1 Diabetes dataset

Ten baseline variables, AGE (in years), SEX, body mass index (BMI), average blood pressure (BP), and six blood serum measurements (S1 to S6) were obtained for each of 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline (Y). The data is available in the file diabetes.txt (18.1 KB) and a brief description of the variables is here.

1.1 Data reading and exploratory analysis

Read the data, inspect it and repair anything that may need to be fixed or that can be improved.

url <- "https://web.tecnico.ulisboa.pt/paulo.soares/aml/data/diabetes.txt"
db <- read.delim(url)
str(db)

'data.frame':   442 obs. of  11 variables:
 $ AGE: int  59 48 72 24 50 23 36 66 60 29 ...
 $ SEX: int  2 1 2 1 1 1 2 2 2 1 ...
 $ BMI: num  32.1 21.6 30.5 25.3 23 22.6 22 26.2 32.1 30 ...
 $ BP : num  101 87 93 84 101 89 90 114 83 85 ...
 $ S1 : int  157 183 156 198 192 139 160 255 179 180 ...
 $ S2 : num  93.2 103.2 93.6 131.4 125.4 ...
 $ S3 : num  38 70 41 40 52 61 50 56 42 43 ...
 $ S4 : num  4 3 4 5 4 2 3 4.55 4 4 ...
 $ S5 : num  4.86 3.89 4.67 4.89 4.29 ...
 $ S6 : int  87 69 85 89 80 68 82 92 94 88 ...
 $ Y  : int  151 75 141 206 135 97 138 63 110 310 ...

summary(db)

      AGE            SEX            BMI             BP              S1     
 Min.   :19.0   Min.   :1.00   Min.   :18.0   Min.   : 62.0   Min.   : 97  
 1st Qu.:38.2   1st Qu.:1.00   1st Qu.:23.2   1st Qu.: 84.0   1st Qu.:164  
 Median :50.0   Median :1.00   Median :25.7   Median : 93.0   Median :186  
 Mean   :48.5   Mean   :1.47   Mean   :26.4   Mean   : 94.6   Mean   :189  
 3rd Qu.:59.0   3rd Qu.:2.00   3rd Qu.:29.3   3rd Qu.:105.0   3rd Qu.:210  
 Max.   :79.0   Max.   :2.00   Max.   :42.2   Max.   :133.0   Max.   :301  
       S2              S3             S4             S5             S6       
 Min.   : 41.6   Min.   :22.0   Min.   :2.00   Min.   :3.26   Min.   : 58.0  
 1st Qu.: 96.0   1st Qu.:40.2   1st Qu.:3.00   1st Qu.:4.28   1st Qu.: 83.2  
 Median :113.0   Median :48.0   Median :4.00   Median :4.62   Median : 91.0  
 Mean   :115.4   Mean   :49.8   Mean   :4.07   Mean   :4.64   Mean   : 91.3  
 3rd Qu.:134.5   3rd Qu.:57.8   3rd Qu.:5.00   3rd Qu.:5.00   3rd Qu.: 98.0  
 Max.   :242.4   Max.   :99.0   Max.   :9.09   Max.   :6.11   Max.   :124.0  
       Y      
 Min.   : 25  
 1st Qu.: 87  
 Median :140  
 Mean   :152  
 3rd Qu.:212  
 Max.   :346

db$SEX <- as.factor(db$SEX)
str(db)

'data.frame':   442 obs. of  11 variables:
 $ AGE: int  59 48 72 24 50 23 36 66 60 29 ...
 $ SEX: Factor w/ 2 levels "1","2": 2 1 2 1 1 1 2 2 2 1 ...
 $ BMI: num  32.1 21.6 30.5 25.3 23 22.6 22 26.2 32.1 30 ...
 $ BP : num  101 87 93 84 101 89 90 114 83 85 ...
 $ S1 : int  157 183 156 198 192 139 160 255 179 180 ...
 $ S2 : num  93.2 103.2 93.6 131.4 125.4 ...
 $ S3 : num  38 70 41 40 52 61 50 56 42 43 ...
 $ S4 : num  4 3 4 5 4 2 3 4.55 4 4 ...
 $ S5 : num  4.86 3.89 4.67 4.89 4.29 ...
 $ S6 : int  87 69 85 89 80 68 82 92 94 88 ...
 $ Y  : int  151 75 141 206 135 97 138 63 110 310 ...

For now we can only consider a single covariate. Try to convince yourself that BMI is a good starting candidate producing, for instance, all the scatter plots between the response variable and each covariate and by analyzing the linear correlation between all variables with both numerical measures and graphical representations.

library(ggplot2)
library(dplyr)

db |>
  tidyr::pivot_longer(-c(Y, SEX), names_to = "Covariate",
                                  values_to = "Value") |>
  ggplot() +
    geom_point(aes(Value, Y, color = SEX), size = 1/3) + 
    facet_wrap(~ Covariate, scales = "free_x") +
    theme(legend.position = "top")

ggplot(db) +
  geom_boxplot(aes(Y, fill = SEX)) +
    theme(legend.position = "top")

db |> 
  select(-SEX) |>
  cor()

         AGE     BMI      BP      S1      S2       S3      S4      S5      S6
AGE  1.00000  0.1851  0.3354 0.26006  0.2192 -0.07518  0.2038  0.2708  0.3017
BMI  0.18508  1.0000  0.3954 0.24978  0.2612 -0.36681  0.4138  0.4462  0.3887
BP   0.33543  0.3954  1.0000 0.24246  0.1855 -0.17876  0.2577  0.3935  0.3904
S1   0.26006  0.2498  0.2425 1.00000  0.8967  0.05152  0.5422  0.5155  0.3257
S2   0.21924  0.2612  0.1855 0.89666  1.0000 -0.19646  0.6598  0.3184  0.2906
S3  -0.07518 -0.3668 -0.1788 0.05152 -0.1965  1.00000 -0.7385 -0.3986 -0.2737
S4   0.20384  0.4138  0.2577 0.54221  0.6598 -0.73849  1.0000  0.6179  0.4172
S5   0.27077  0.4462  0.3935 0.51550  0.3184 -0.39858  0.6179  1.0000  0.4647
S6   0.30173  0.3887  0.3904 0.32572  0.2906 -0.27370  0.4172  0.4647  1.0000
Y    0.18789  0.5865  0.4415 0.21202  0.1741 -0.39479  0.4305  0.5659  0.3825
          Y
AGE  0.1879
BMI  0.5865
BP   0.4415
S1   0.2120
S2   0.1741
S3  -0.3948
S4   0.4305
S5   0.5659
S6   0.3825
Y    1.0000

GGally::ggcorr(db)

1.2 Linear regression in

The lm function is used to fit linear models, including multivariate ones. Use it to fit a SLRM with the covariate referred before and explore the returned object.

model <- lm(Y ~ BMI, data = db)
model


Call:
lm(formula = Y ~ BMI, data = db)

Coefficients:
(Intercept)          BMI  
     -117.8         10.2

ms <- summary(model)
ms


Call:
lm(formula = Y ~ BMI, data = db)

Residuals:
    Min      1Q  Median      3Q     Max 
-164.92  -43.57   -8.65   46.34  154.88 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -117.773     18.019   -6.54  1.8e-10 ***
BMI           10.233      0.674   15.19  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62.5 on 440 degrees of freedom
Multiple R-squared:  0.344, Adjusted R-squared:  0.342 
F-statistic:  231 on 1 and 440 DF,  p-value: <2e-16

ms$coefficients[2, 1]

[1] 10.23

ms$sigma

[1] 62.52

ms$r.squared

[1] 0.3439

par(mfrow = c(2, 2))
plot(model)

g <- ggplot(db, aes(BMI, Y)) +
       geom_point()

g + geom_smooth(method = lm, se = FALSE)

Compute 96% confidence intervals for the regression coefficients and for the mean value of the response variable for people with BMI equal to 25 and 40.

confint(model, level = 0.96)

                 2 %   98 %
(Intercept) -154.890 -80.66
BMI            8.845  11.62

new_data <- data.frame(BMI = c(25, 40))
predict(model, new_data, level = 0.96, interval = "confidence")

    fit   lwr   upr
1 138.1 131.6 144.5
2 291.6 271.7 311.4

g + geom_smooth(method = lm, level = 0.96)

predict(model, new_data, level = 0.96, interval = "prediction")

    fit     lwr   upr
1 138.1   9.123 267.0
2 291.6 161.255 421.8

gist <- "https://gist.github.com/andrewbaxter439/b508a60786f8af3c0be7b381a667ae07"
devtools::source_gist(gist, quiet = TRUE)
g +
  stat_predict(method = "lm", geom = "ribbon", fill = "hotpink",
               colour = NA, level = 0.96, alpha = 0.075) +
  geom_smooth(method = lm, level = 0.96)

Notes

The last graphic uses external code that extends ggplot2 capabilities with the stat_predict function.
Both bands can be misleading regarding the global confidence level. Corrections are required.

References

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) Least Angle Regression, Annals of Statistics (with discussion), 407-499.