Linear Models

Lab#2 – Multiple linear regression model 1

Author

Paulo Soares

Published

October 9, 2025

1 The diabetes dataset again . . .

We will keep using the dataset introduced in Lab#1 where we had this brief description:

Ten baseline variables, AGE (in years), SEX, body mass index (BMI), average blood pressure (BP), and six blood serum measurements (S1 to S6) were obtained for each of 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline (Y). The data is available in the file diabetes.txt.

1.1 The specification of linear models

In R, models are specified in a compact symbolic form. From the documentation:

The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.

Define the following models:

  1. a first-order model with all covariates except SEX;

    url <- "https://web.tecnico.ulisboa.pt/paulo.soares/aml/data/diabetes.txt"
    db <- read.delim(url)
    f1 <- Y ~ . - SEX
    terms(f1, data = db)
    Y ~ (AGE + SEX + BMI + BP + S1 + S2 + S3 + S4 + S5 + S6) - SEX
    attr(,"variables")
    list(Y, AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6)
    attr(,"factors")
        AGE BMI BP S1 S2 S3 S4 S5 S6
    Y     0   0  0  0  0  0  0  0  0
    AGE   1   0  0  0  0  0  0  0  0
    SEX   0   0  0  0  0  0  0  0  0
    BMI   0   1  0  0  0  0  0  0  0
    BP    0   0  1  0  0  0  0  0  0
    S1    0   0  0  1  0  0  0  0  0
    S2    0   0  0  0  1  0  0  0  0
    S3    0   0  0  0  0  1  0  0  0
    S4    0   0  0  0  0  0  1  0  0
    S5    0   0  0  0  0  0  0  1  0
    S6    0   0  0  0  0  0  0  0  1
    attr(,"term.labels")
    [1] "AGE" "BMI" "BP"  "S1"  "S2"  "S3"  "S4"  "S5"  "S6" 
    attr(,"order")
    [1] 1 1 1 1 1 1 1 1 1
    attr(,"intercept")
    [1] 1
    attr(,"response")
    [1] 1
    attr(,".Environment")
    <environment: R_GlobalEnv>
  2. a second-order model with the BMI and BP covariates modelling the logarithm of the response variable;

    f2 <- log(Y) ~ BMI * BP
    terms(f2, data = db)
    log(Y) ~ BMI * BP
    attr(,"variables")
    list(log(Y), BMI, BP)
    attr(,"factors")
           BMI BP BMI:BP
    log(Y)   0  0      0
    BMI      1  0      1
    BP       0  1      1
    attr(,"term.labels")
    [1] "BMI"    "BP"     "BMI:BP"
    attr(,"order")
    [1] 1 1 2
    attr(,"intercept")
    [1] 1
    attr(,"response")
    [1] 1
    attr(,".Environment")
    <environment: R_GlobalEnv>
  3. a model with the covariates S1 to S6 and all its second-order interactions;

    f3 <- Y ~ (S1 + S2 + S3 + S4 + S5 + S6)^2
    terms(f3, data = db)
    Y ~ (S1 + S2 + S3 + S4 + S5 + S6)^2
    attr(,"variables")
    list(Y, S1, S2, S3, S4, S5, S6)
    attr(,"factors")
       S1 S2 S3 S4 S5 S6 S1:S2 S1:S3 S1:S4 S1:S5 S1:S6 S2:S3 S2:S4 S2:S5 S2:S6
    Y   0  0  0  0  0  0     0     0     0     0     0     0     0     0     0
    S1  1  0  0  0  0  0     1     1     1     1     1     0     0     0     0
    S2  0  1  0  0  0  0     1     0     0     0     0     1     1     1     1
    S3  0  0  1  0  0  0     0     1     0     0     0     1     0     0     0
    S4  0  0  0  1  0  0     0     0     1     0     0     0     1     0     0
    S5  0  0  0  0  1  0     0     0     0     1     0     0     0     1     0
    S6  0  0  0  0  0  1     0     0     0     0     1     0     0     0     1
       S3:S4 S3:S5 S3:S6 S4:S5 S4:S6 S5:S6
    Y      0     0     0     0     0     0
    S1     0     0     0     0     0     0
    S2     0     0     0     0     0     0
    S3     1     1     1     0     0     0
    S4     1     0     0     1     1     0
    S5     0     1     0     1     0     1
    S6     0     0     1     0     1     1
    attr(,"term.labels")
     [1] "S1"    "S2"    "S3"    "S4"    "S5"    "S6"    "S1:S2" "S1:S3" "S1:S4"
    [10] "S1:S5" "S1:S6" "S2:S3" "S2:S4" "S2:S5" "S2:S6" "S3:S4" "S3:S5" "S3:S6"
    [19] "S4:S5" "S4:S6" "S5:S6"
    attr(,"order")
     [1] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
    attr(,"intercept")
    [1] 1
    attr(,"response")
    [1] 1
    attr(,".Environment")
    <environment: R_GlobalEnv>
  4. a model with the covariates BMI, BP and (S1+S2+S3).

    f4 <- Y ~ BMI + BP + I(S1 + S2 + S3)
    terms(f4, data = db)
    Y ~ BMI + BP + I(S1 + S2 + S3)
    attr(,"variables")
    list(Y, BMI, BP, I(S1 + S2 + S3))
    attr(,"factors")
                    BMI BP I(S1 + S2 + S3)
    Y                 0  0               0
    BMI               1  0               0
    BP                0  1               0
    I(S1 + S2 + S3)   0  0               1
    attr(,"term.labels")
    [1] "BMI"             "BP"              "I(S1 + S2 + S3)"
    attr(,"order")
    [1] 1 1 1
    attr(,"intercept")
    [1] 1
    attr(,"response")
    [1] 1
    attr(,".Environment")
    <environment: R_GlobalEnv>

Notes

  1. the design matrix can be obtained with the model.matrix function.

    dm1 <- model.matrix(f1, data = db)
    # Uncomment the next line if you fancy looking at large matrices (442 x 10).
    # dm1
  2. formulas can also be written as strings.

    f <- paste("Y ~", paste0("S", 1:10, collapse = " + "))
    terms(formula(f))
    Y ~ S1 + S2 + S3 + S4 + S5 + S6 + S7 + S8 + S9 + S10
    attr(,"variables")
    list(Y, S1, S2, S3, S4, S5, S6, S7, S8, S9, S10)
    attr(,"factors")
        S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
    Y    0  0  0  0  0  0  0  0  0   0
    S1   1  0  0  0  0  0  0  0  0   0
    S2   0  1  0  0  0  0  0  0  0   0
    S3   0  0  1  0  0  0  0  0  0   0
    S4   0  0  0  1  0  0  0  0  0   0
    S5   0  0  0  0  1  0  0  0  0   0
    S6   0  0  0  0  0  1  0  0  0   0
    S7   0  0  0  0  0  0  1  0  0   0
    S8   0  0  0  0  0  0  0  1  0   0
    S9   0  0  0  0  0  0  0  0  1   0
    S10  0  0  0  0  0  0  0  0  0   1
    attr(,"term.labels")
     [1] "S1"  "S2"  "S3"  "S4"  "S5"  "S6"  "S7"  "S8"  "S9"  "S10"
    attr(,"order")
     [1] 1 1 1 1 1 1 1 1 1 1
    attr(,"intercept")
    [1] 1
    attr(,"response")
    [1] 1
    attr(,".Environment")
    <environment: R_GlobalEnv>

1.2 Tests for the regression coefficients

  1. Consider the first-order model with all covariates and test its significance explaining all the related results in the output.

    fullmodel <- lm(Y ~ ., data = db)
    
    sum_full <- summary(fullmodel)
    sum_full
    
    Call:
    lm(formula = Y ~ ., data = db)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -155.83  -38.54   -0.23   37.81  151.35 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -334.5671    67.4546   -4.96  1.0e-06 ***
    AGE           -0.0364     0.2170   -0.17   0.8670    
    SEX          -22.8596     5.8358   -3.92   0.0001 ***
    BMI            5.6030     0.7171    7.81  4.3e-14 ***
    BP             1.1168     0.2252    4.96  1.0e-06 ***
    S1            -1.0900     0.5733   -1.90   0.0579 .  
    S2             0.7465     0.5308    1.41   0.1604    
    S3             0.3720     0.7825    0.48   0.6347    
    S4             6.5338     5.9586    1.10   0.2735    
    S5            68.4831    15.6697    4.37  1.6e-05 ***
    S6             0.2801     0.2733    1.02   0.3060    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 54.2 on 431 degrees of freedom
    Multiple R-squared:  0.518, Adjusted R-squared:  0.507 
    F-statistic: 46.3 on 10 and 431 DF,  p-value: <2e-16

    We are testing the hypothesis \(H_0: \forall i >0: \beta_i=0\) against \(H_1: \exists i >0: \beta_i\neq 0\) using a F-test, that means, a test statistic with a F distribution with \(p-1=10\) and \(n-p=431\) degrees of freedom. The observed value of the test statistic is \(F_o=46.27\) that leads to a p-value of \(P(F>F_o\mid H_0)\approx 0\). This results allows us to reject \(H_0\) and to conclude that these covariates have some influence on the response variable. However, we also have \(R^2=0.5177\) that shows that, as used, the covariates only account for approximately 52% of the observed variation in the response variable. The observed vale of the adjusted \(R^2\) only shows a small penalty for using all 9 covariates.

    Note: none of these results depend on the order of the terms in the model.

    sum_full
    fullmodel2 <- lm(Y ~ ., data = db[, sample(ncol(db))])
    summary(fullmodel2)
    
    Call:
    lm(formula = Y ~ ., data = db)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -155.83  -38.54   -0.23   37.81  151.35 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -334.5671    67.4546   -4.96  1.0e-06 ***
    AGE           -0.0364     0.2170   -0.17   0.8670    
    SEX          -22.8596     5.8358   -3.92   0.0001 ***
    BMI            5.6030     0.7171    7.81  4.3e-14 ***
    BP             1.1168     0.2252    4.96  1.0e-06 ***
    S1            -1.0900     0.5733   -1.90   0.0579 .  
    S2             0.7465     0.5308    1.41   0.1604    
    S3             0.3720     0.7825    0.48   0.6347    
    S4             6.5338     5.9586    1.10   0.2735    
    S5            68.4831    15.6697    4.37  1.6e-05 ***
    S6             0.2801     0.2733    1.02   0.3060    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 54.2 on 431 degrees of freedom
    Multiple R-squared:  0.518, Adjusted R-squared:  0.507 
    F-statistic: 46.3 on 10 and 431 DF,  p-value: <2e-16
    
    Call:
    lm(formula = Y ~ ., data = db[, sample(ncol(db))])
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -155.83  -38.54   -0.23   37.81  151.35 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -334.5671    67.4546   -4.96  1.0e-06 ***
    S1            -1.0900     0.5733   -1.90   0.0579 .  
    S2             0.7465     0.5308    1.41   0.1604    
    S6             0.2801     0.2733    1.02   0.3060    
    S4             6.5338     5.9586    1.10   0.2735    
    AGE           -0.0364     0.2170   -0.17   0.8670    
    S3             0.3720     0.7825    0.48   0.6347    
    BP             1.1168     0.2252    4.96  1.0e-06 ***
    S5            68.4831    15.6697    4.37  1.6e-05 ***
    BMI            5.6030     0.7171    7.81  4.3e-14 ***
    SEX          -22.8596     5.8358   -3.92   0.0001 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 54.2 on 431 degrees of freedom
    Multiple R-squared:  0.518, Adjusted R-squared:  0.507 
    F-statistic: 46.3 on 10 and 431 DF,  p-value: <2e-16
  2. Test the removal of the covariate S3 from the model.

    reducedmodel <- update(fullmodel, . ~ . - S3)
    
    anova(reducedmodel, fullmodel)
    Analysis of Variance Table
    
    Model 1: Y ~ AGE + SEX + BMI + BP + S1 + S2 + S4 + S5 + S6
    Model 2: Y ~ AGE + SEX + BMI + BP + S1 + S2 + S3 + S4 + S5 + S6
      Res.Df     RSS Df Sum of Sq    F Pr(>F)
    1    432 1264649                         
    2    431 1263986  1       663 0.23   0.63

    Note that \(t_o^2 = 0.4754^2 = 0.226\), from the output in a., is equal to \(F_o\), with both tests leading to the same p-value.

  3. Test the hypothesis \(H_0:\beta_2=5\).

    reducedmodel <- update(fullmodel, . ~ . - BMI + offset(5 * BMI))
    
    anova(reducedmodel, fullmodel)
    Analysis of Variance Table
    
    Model 1: Y ~ AGE + SEX + BP + S1 + S2 + S3 + S4 + S5 + S6 + offset(5 * 
        BMI)
    Model 2: Y ~ AGE + SEX + BMI + BP + S1 + S2 + S3 + S4 + S5 + S6
      Res.Df     RSS Df Sum of Sq    F Pr(>F)
    1    432 1266059                         
    2    431 1263986  1      2073 0.71    0.4
  4. Test the hypothesis that all regression coefficients related to the covariates S1 to S6 are equal.

    reducedmodel <- lm(Y ~ AGE + BMI + BP + I(S1 + S2 + S3 + S4 + S5 + S6), data = db)
    
    anova(reducedmodel, fullmodel)
    Analysis of Variance Table
    
    Model 1: Y ~ AGE + BMI + BP + I(S1 + S2 + S3 + S4 + S5 + S6)
    Model 2: Y ~ AGE + SEX + BMI + BP + S1 + S2 + S3 + S4 + S5 + S6
      Res.Df     RSS Df Sum of Sq    F Pr(>F)    
    1    437 1582504                             
    2    431 1263986  6    318518 18.1 <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.3 Sequential F-tests

Starting from the null model, analyse the evolution of \(SSR\) and \(SSE\) as covariates are included in the first-order model.

Along with the summary of the output from the lm function, another analysis tool is the ANOVA table:

anv_full <- anova(fullmodel)
anv_full
Analysis of Variance Table

Response: Y
           Df  Sum Sq Mean Sq F value  Pr(>F)    
AGE         1   92527   92527   31.55 3.5e-08 ***
SEX         1     293     293    0.10    0.75    
BMI         1  826955  826955  281.98 < 2e-16 ***
BP          1  129312  129312   44.09 9.4e-11 ***
S1          1    1791    1791    0.61    0.43    
S2          1    5058    5058    1.72    0.19    
S3          1  237329  237329   80.93 < 2e-16 ***
S4          1    1821    1821    0.62    0.43    
S5          1   58856   58856   20.07 9.6e-06 ***
S6          1    3080    3080    1.05    0.31    
Residuals 431 1263986    2933                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SSE <- tail(anv_full$`Sum Sq`, 1)
SSE
[1] 1263986
SSR <- sum(head(anv_full$`Sum Sq`, -1))
SSR
[1] 1357023
SSR + SSE
[1] 2621009
SST <- var(db$Y) * (nrow(db) - 1)
SST
[1] 2621009
decomp <- data.frame(SSR=c(0, cumsum(head(anv_full$`Sum Sq`, -1))))
decomp$SSE <- SST - decomp$SSR
row.names(decomp) <- c("NULL", head(row.names(anv_full), -1))
decomp$`ΔSSR` <- round(c(0, diff(decomp$SSR)), 0)
decomp
         SSR     SSE   ΔSSR
NULL       0 2621009      0
AGE    92527 2528482  92527
SEX    92821 2528188    293
BMI   919776 1701233 826955
BP   1049088 1571921 129312
S1   1050879 1570130   1791
S2   1055937 1565073   5058
S3   1293266 1327743 237329
S4   1295087 1325922   1821
S5   1353943 1267066  58856
S6   1357023 1263986   3080

Note: unlike the results from the model fit, now all of these sequential F-tests depend on the order that terms are included in the model.

anv_full
anova(fullmodel2)
Analysis of Variance Table

Response: Y
           Df  Sum Sq Mean Sq F value  Pr(>F)    
AGE         1   92527   92527   31.55 3.5e-08 ***
SEX         1     293     293    0.10    0.75    
BMI         1  826955  826955  281.98 < 2e-16 ***
BP          1  129312  129312   44.09 9.4e-11 ***
S1          1    1791    1791    0.61    0.43    
S2          1    5058    5058    1.72    0.19    
S3          1  237329  237329   80.93 < 2e-16 ***
S4          1    1821    1821    0.62    0.43    
S5          1   58856   58856   20.07 9.6e-06 ***
S6          1    3080    3080    1.05    0.31    
Residuals 431 1263986    2933                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table

Response: Y
           Df  Sum Sq Mean Sq F value  Pr(>F)    
S1          1  117824  117824   40.18 5.9e-10 ***
S2          1    3449    3449    1.18  0.2788    
S6          1  287816  287816   98.14 < 2e-16 ***
S4          1  323072  323072  110.16 < 2e-16 ***
AGE         1    6672    6672    2.28  0.1322    
S3          1  131058  131058   44.69 7.2e-11 ***
BP          1  138048  138048   47.07 2.4e-11 ***
S5          1   94223   94223   32.13 2.6e-08 ***
BMI         1  209863  209863   71.56 4.2e-16 ***
SEX         1   44999   44999   15.34  0.0001 ***
Residuals 431 1263986    2933                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1