Linear Models

Lab#1 – Simple linear regression model

Author

Paulo Soares

Published

October 8, 2025

1 Diabetes dataset

Ten baseline variables, AGE (in years), SEX, body mass index (BMI), average blood pressure (BP), and six blood serum measurements (S1 to S6) were obtained for each of 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline (Y). The data is available in the file diabetes.txt (18.1 KB) and a brief description of the variables is here.

1.1 Data reading and exploratory analysis

  1. Read the data, inspect it and repair anything that may need to be fixed or that can be improved.

    url <- "https://web.tecnico.ulisboa.pt/paulo.soares/aml/data/diabetes.txt"
    db <- read.delim(url)
    str(db)
    'data.frame':   442 obs. of  11 variables:
     $ AGE: int  59 48 72 24 50 23 36 66 60 29 ...
     $ SEX: int  2 1 2 1 1 1 2 2 2 1 ...
     $ BMI: num  32.1 21.6 30.5 25.3 23 22.6 22 26.2 32.1 30 ...
     $ BP : num  101 87 93 84 101 89 90 114 83 85 ...
     $ S1 : int  157 183 156 198 192 139 160 255 179 180 ...
     $ S2 : num  93.2 103.2 93.6 131.4 125.4 ...
     $ S3 : num  38 70 41 40 52 61 50 56 42 43 ...
     $ S4 : num  4 3 4 5 4 2 3 4.55 4 4 ...
     $ S5 : num  4.86 3.89 4.67 4.89 4.29 ...
     $ S6 : int  87 69 85 89 80 68 82 92 94 88 ...
     $ Y  : int  151 75 141 206 135 97 138 63 110 310 ...
    summary(db)
          AGE            SEX            BMI             BP              S1     
     Min.   :19.0   Min.   :1.00   Min.   :18.0   Min.   : 62.0   Min.   : 97  
     1st Qu.:38.2   1st Qu.:1.00   1st Qu.:23.2   1st Qu.: 84.0   1st Qu.:164  
     Median :50.0   Median :1.00   Median :25.7   Median : 93.0   Median :186  
     Mean   :48.5   Mean   :1.47   Mean   :26.4   Mean   : 94.6   Mean   :189  
     3rd Qu.:59.0   3rd Qu.:2.00   3rd Qu.:29.3   3rd Qu.:105.0   3rd Qu.:210  
     Max.   :79.0   Max.   :2.00   Max.   :42.2   Max.   :133.0   Max.   :301  
           S2              S3             S4             S5             S6       
     Min.   : 41.6   Min.   :22.0   Min.   :2.00   Min.   :3.26   Min.   : 58.0  
     1st Qu.: 96.0   1st Qu.:40.2   1st Qu.:3.00   1st Qu.:4.28   1st Qu.: 83.2  
     Median :113.0   Median :48.0   Median :4.00   Median :4.62   Median : 91.0  
     Mean   :115.4   Mean   :49.8   Mean   :4.07   Mean   :4.64   Mean   : 91.3  
     3rd Qu.:134.5   3rd Qu.:57.8   3rd Qu.:5.00   3rd Qu.:5.00   3rd Qu.: 98.0  
     Max.   :242.4   Max.   :99.0   Max.   :9.09   Max.   :6.11   Max.   :124.0  
           Y      
     Min.   : 25  
     1st Qu.: 87  
     Median :140  
     Mean   :152  
     3rd Qu.:212  
     Max.   :346  
    db$SEX <- as.factor(db$SEX)
    str(db)
    'data.frame':   442 obs. of  11 variables:
     $ AGE: int  59 48 72 24 50 23 36 66 60 29 ...
     $ SEX: Factor w/ 2 levels "1","2": 2 1 2 1 1 1 2 2 2 1 ...
     $ BMI: num  32.1 21.6 30.5 25.3 23 22.6 22 26.2 32.1 30 ...
     $ BP : num  101 87 93 84 101 89 90 114 83 85 ...
     $ S1 : int  157 183 156 198 192 139 160 255 179 180 ...
     $ S2 : num  93.2 103.2 93.6 131.4 125.4 ...
     $ S3 : num  38 70 41 40 52 61 50 56 42 43 ...
     $ S4 : num  4 3 4 5 4 2 3 4.55 4 4 ...
     $ S5 : num  4.86 3.89 4.67 4.89 4.29 ...
     $ S6 : int  87 69 85 89 80 68 82 92 94 88 ...
     $ Y  : int  151 75 141 206 135 97 138 63 110 310 ...
  2. For now we can only consider a single covariate. Try to convince yourself that BMI is a good starting candidate producing, for instance, all the scatter plots between the response variable and each covariate and by analyzing the linear correlation between all variables with both numerical measures and graphical representations.

    library(ggplot2)
    library(dplyr)
    
    db |>
      tidyr::pivot_longer(-c(Y, SEX), names_to = "Covariate",
                                      values_to = "Value") |>
      ggplot() +
        geom_point(aes(Value, Y, color = SEX), size = 1/3) + 
        facet_wrap(~ Covariate, scales = "free_x") +
        theme(legend.position = "top")

    ggplot(db) +
      geom_boxplot(aes(Y, fill = SEX)) +
        theme(legend.position = "top")

    db |> 
      select(-SEX) |>
      cor()
             AGE     BMI      BP      S1      S2       S3      S4      S5      S6
    AGE  1.00000  0.1851  0.3354 0.26006  0.2192 -0.07518  0.2038  0.2708  0.3017
    BMI  0.18508  1.0000  0.3954 0.24978  0.2612 -0.36681  0.4138  0.4462  0.3887
    BP   0.33543  0.3954  1.0000 0.24246  0.1855 -0.17876  0.2577  0.3935  0.3904
    S1   0.26006  0.2498  0.2425 1.00000  0.8967  0.05152  0.5422  0.5155  0.3257
    S2   0.21924  0.2612  0.1855 0.89666  1.0000 -0.19646  0.6598  0.3184  0.2906
    S3  -0.07518 -0.3668 -0.1788 0.05152 -0.1965  1.00000 -0.7385 -0.3986 -0.2737
    S4   0.20384  0.4138  0.2577 0.54221  0.6598 -0.73849  1.0000  0.6179  0.4172
    S5   0.27077  0.4462  0.3935 0.51550  0.3184 -0.39858  0.6179  1.0000  0.4647
    S6   0.30173  0.3887  0.3904 0.32572  0.2906 -0.27370  0.4172  0.4647  1.0000
    Y    0.18789  0.5865  0.4415 0.21202  0.1741 -0.39479  0.4305  0.5659  0.3825
              Y
    AGE  0.1879
    BMI  0.5865
    BP   0.4415
    S1   0.2120
    S2   0.1741
    S3  -0.3948
    S4   0.4305
    S5   0.5659
    S6   0.3825
    Y    1.0000
    GGally::ggcorr(db)

1.2 Linear regression in

  1. The lm function is used to fit linear models, including multivariate ones. Use it to fit a SLRM with the covariate referred before and explore the returned object.

    model <- lm(Y ~ BMI, data = db)
    model
    
    Call:
    lm(formula = Y ~ BMI, data = db)
    
    Coefficients:
    (Intercept)          BMI  
         -117.8         10.2  
    ms <- summary(model)
    ms
    
    Call:
    lm(formula = Y ~ BMI, data = db)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -164.92  -43.57   -8.65   46.34  154.88 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -117.773     18.019   -6.54  1.8e-10 ***
    BMI           10.233      0.674   15.19  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 62.5 on 440 degrees of freedom
    Multiple R-squared:  0.344, Adjusted R-squared:  0.342 
    F-statistic:  231 on 1 and 440 DF,  p-value: <2e-16
    ms$coefficients[2, 1]
    [1] 10.23
    ms$sigma
    [1] 62.52
    ms$r.squared
    [1] 0.3439
    par(mfrow = c(2, 2))
    plot(model)

    g <- ggplot(db, aes(BMI, Y)) +
           geom_point()
    
    g + geom_smooth(method = lm, se = FALSE)

  2. Compute 96% confidence intervals for the regression coefficients and for the mean value of the response variable for people with BMI equal to 25 and 40.

    confint(model, level = 0.96)
                     2 %   98 %
    (Intercept) -154.890 -80.66
    BMI            8.845  11.62
    new_data <- data.frame(BMI = c(25, 40))
    predict(model, new_data, level = 0.96, interval = "confidence")
        fit   lwr   upr
    1 138.1 131.6 144.5
    2 291.6 271.7 311.4
    g + geom_smooth(method = lm, level = 0.96)

    predict(model, new_data, level = 0.96, interval = "prediction")
        fit     lwr   upr
    1 138.1   9.123 267.0
    2 291.6 161.255 421.8
    gist <- "https://gist.github.com/andrewbaxter439/b508a60786f8af3c0be7b381a667ae07"
    devtools::source_gist(gist, quiet = TRUE)
    g +
      stat_predict(method = "lm", geom = "ribbon", fill = "hotpink",
                   colour = NA, level = 0.96, alpha = 0.075) +
      geom_smooth(method = lm, level = 0.96)  

    Notes

    1. The last graphic uses external code that extends ggplot2 capabilities with the stat_predict function.

    2. Both bands can be misleading regarding the global confidence level. Corrections are required.


References

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) Least Angle Regression, Annals of Statistics (with discussion), 407-499.