Linear Models
Lab#2 – Multiple linear regression model 1
1 The diabetes dataset again . . .
We will keep using the dataset introduced in Lab#1 where we had this brief description:
Ten baseline variables,
AGE(in years),SEX, body mass index (BMI), average blood pressure (BP), and six blood serum measurements (S1toS6) were obtained for each of 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline (Y). The data is available in the file diabetes.txt.
1.1 The specification of linear models
In R, models are specified in a compact symbolic form. From the documentation:
The
~operator is basic in the formation of such models. An expression of the formy ~ modelis interpreted as a specification that the responseyis modelled by a linear predictor specified symbolically bymodel. Such a model consists of a series of terms separated by+operators. The terms themselves consist of variable and factor names separated by:operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.
Define the following models:
a first-order model with all covariates except
SEX;url <- "https://web.tecnico.ulisboa.pt/paulo.soares/aml/data/diabetes.txt" db <- read.delim(url)f1 <- Y ~ . - SEX terms(f1, data = db)Y ~ (AGE + SEX + BMI + BP + S1 + S2 + S3 + S4 + S5 + S6) - SEX attr(,"variables") list(Y, AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6) attr(,"factors") AGE BMI BP S1 S2 S3 S4 S5 S6 Y 0 0 0 0 0 0 0 0 0 AGE 1 0 0 0 0 0 0 0 0 SEX 0 0 0 0 0 0 0 0 0 BMI 0 1 0 0 0 0 0 0 0 BP 0 0 1 0 0 0 0 0 0 S1 0 0 0 1 0 0 0 0 0 S2 0 0 0 0 1 0 0 0 0 S3 0 0 0 0 0 1 0 0 0 S4 0 0 0 0 0 0 1 0 0 S5 0 0 0 0 0 0 0 1 0 S6 0 0 0 0 0 0 0 0 1 attr(,"term.labels") [1] "AGE" "BMI" "BP" "S1" "S2" "S3" "S4" "S5" "S6" attr(,"order") [1] 1 1 1 1 1 1 1 1 1 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,".Environment") <environment: R_GlobalEnv>a second-order model with the
BMIandBPcovariates modelling the logarithm of the response variable;f2 <- log(Y) ~ BMI * BP terms(f2, data = db)log(Y) ~ BMI * BP attr(,"variables") list(log(Y), BMI, BP) attr(,"factors") BMI BP BMI:BP log(Y) 0 0 0 BMI 1 0 1 BP 0 1 1 attr(,"term.labels") [1] "BMI" "BP" "BMI:BP" attr(,"order") [1] 1 1 2 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,".Environment") <environment: R_GlobalEnv>a model with the covariates
S1toS6and all its second-order interactions;f3 <- Y ~ (S1 + S2 + S3 + S4 + S5 + S6)^2 terms(f3, data = db)Y ~ (S1 + S2 + S3 + S4 + S5 + S6)^2 attr(,"variables") list(Y, S1, S2, S3, S4, S5, S6) attr(,"factors") S1 S2 S3 S4 S5 S6 S1:S2 S1:S3 S1:S4 S1:S5 S1:S6 S2:S3 S2:S4 S2:S5 S2:S6 Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 S2 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 S3 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 S4 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 S5 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 S6 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 S3:S4 S3:S5 S3:S6 S4:S5 S4:S6 S5:S6 Y 0 0 0 0 0 0 S1 0 0 0 0 0 0 S2 0 0 0 0 0 0 S3 1 1 1 0 0 0 S4 1 0 0 1 1 0 S5 0 1 0 1 0 1 S6 0 0 1 0 1 1 attr(,"term.labels") [1] "S1" "S2" "S3" "S4" "S5" "S6" "S1:S2" "S1:S3" "S1:S4" [10] "S1:S5" "S1:S6" "S2:S3" "S2:S4" "S2:S5" "S2:S6" "S3:S4" "S3:S5" "S3:S6" [19] "S4:S5" "S4:S6" "S5:S6" attr(,"order") [1] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,".Environment") <environment: R_GlobalEnv>a model with the covariates
BMI,BPand(S1+S2+S3).f4 <- Y ~ BMI + BP + I(S1 + S2 + S3) terms(f4, data = db)Y ~ BMI + BP + I(S1 + S2 + S3) attr(,"variables") list(Y, BMI, BP, I(S1 + S2 + S3)) attr(,"factors") BMI BP I(S1 + S2 + S3) Y 0 0 0 BMI 1 0 0 BP 0 1 0 I(S1 + S2 + S3) 0 0 1 attr(,"term.labels") [1] "BMI" "BP" "I(S1 + S2 + S3)" attr(,"order") [1] 1 1 1 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,".Environment") <environment: R_GlobalEnv>
Notes
the design matrix can be obtained with the
model.matrixfunction.dm1 <- model.matrix(f1, data = db) # Uncomment the next line if you fancy looking at large matrices (442 x 10). # dm1formulas can also be written as strings.
f <- paste("Y ~", paste0("S", 1:10, collapse = " + ")) terms(formula(f))Y ~ S1 + S2 + S3 + S4 + S5 + S6 + S7 + S8 + S9 + S10 attr(,"variables") list(Y, S1, S2, S3, S4, S5, S6, S7, S8, S9, S10) attr(,"factors") S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Y 0 0 0 0 0 0 0 0 0 0 S1 1 0 0 0 0 0 0 0 0 0 S2 0 1 0 0 0 0 0 0 0 0 S3 0 0 1 0 0 0 0 0 0 0 S4 0 0 0 1 0 0 0 0 0 0 S5 0 0 0 0 1 0 0 0 0 0 S6 0 0 0 0 0 1 0 0 0 0 S7 0 0 0 0 0 0 1 0 0 0 S8 0 0 0 0 0 0 0 1 0 0 S9 0 0 0 0 0 0 0 0 1 0 S10 0 0 0 0 0 0 0 0 0 1 attr(,"term.labels") [1] "S1" "S2" "S3" "S4" "S5" "S6" "S7" "S8" "S9" "S10" attr(,"order") [1] 1 1 1 1 1 1 1 1 1 1 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,".Environment") <environment: R_GlobalEnv>
1.2 Tests for the regression coefficients
Consider the first-order model with all covariates and test its significance explaining all the related results in the output.
fullmodel <- lm(Y ~ ., data = db) sum_full <- summary(fullmodel) sum_fullCall: lm(formula = Y ~ ., data = db) Residuals: Min 1Q Median 3Q Max -155.83 -38.54 -0.23 37.81 151.35 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -334.5671 67.4546 -4.96 1.0e-06 *** AGE -0.0364 0.2170 -0.17 0.8670 SEX -22.8596 5.8358 -3.92 0.0001 *** BMI 5.6030 0.7171 7.81 4.3e-14 *** BP 1.1168 0.2252 4.96 1.0e-06 *** S1 -1.0900 0.5733 -1.90 0.0579 . S2 0.7465 0.5308 1.41 0.1604 S3 0.3720 0.7825 0.48 0.6347 S4 6.5338 5.9586 1.10 0.2735 S5 68.4831 15.6697 4.37 1.6e-05 *** S6 0.2801 0.2733 1.02 0.3060 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.2 on 431 degrees of freedom Multiple R-squared: 0.518, Adjusted R-squared: 0.507 F-statistic: 46.3 on 10 and 431 DF, p-value: <2e-16We are testing the hypothesis \(H_0: \forall i >0: \beta_i=0\) against \(H_1: \exists i >0: \beta_i\neq 0\) using a F-test, that means, a test statistic with a F distribution with \(p-1=10\) and \(n-p=431\) degrees of freedom. The observed value of the test statistic is \(F_o=46.27\) that leads to a p-value of \(P(F>F_o\mid H_0)\approx 0\). This results allows us to reject \(H_0\) and to conclude that these covariates have some influence on the response variable. However, we also have \(R^2=0.5177\) that shows that, as used, the covariates only account for approximately 52% of the observed variation in the response variable. The observed vale of the adjusted \(R^2\) only shows a small penalty for using all 9 covariates.
Note: none of these results depend on the order of the terms in the model.
sum_full fullmodel2 <- lm(Y ~ ., data = db[, sample(ncol(db))]) summary(fullmodel2)Call: lm(formula = Y ~ ., data = db) Residuals: Min 1Q Median 3Q Max -155.83 -38.54 -0.23 37.81 151.35 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -334.5671 67.4546 -4.96 1.0e-06 *** AGE -0.0364 0.2170 -0.17 0.8670 SEX -22.8596 5.8358 -3.92 0.0001 *** BMI 5.6030 0.7171 7.81 4.3e-14 *** BP 1.1168 0.2252 4.96 1.0e-06 *** S1 -1.0900 0.5733 -1.90 0.0579 . S2 0.7465 0.5308 1.41 0.1604 S3 0.3720 0.7825 0.48 0.6347 S4 6.5338 5.9586 1.10 0.2735 S5 68.4831 15.6697 4.37 1.6e-05 *** S6 0.2801 0.2733 1.02 0.3060 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.2 on 431 degrees of freedom Multiple R-squared: 0.518, Adjusted R-squared: 0.507 F-statistic: 46.3 on 10 and 431 DF, p-value: <2e-16Call: lm(formula = Y ~ ., data = db[, sample(ncol(db))]) Residuals: Min 1Q Median 3Q Max -155.83 -38.54 -0.23 37.81 151.35 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -334.5671 67.4546 -4.96 1.0e-06 *** S1 -1.0900 0.5733 -1.90 0.0579 . S2 0.7465 0.5308 1.41 0.1604 S6 0.2801 0.2733 1.02 0.3060 S4 6.5338 5.9586 1.10 0.2735 AGE -0.0364 0.2170 -0.17 0.8670 S3 0.3720 0.7825 0.48 0.6347 BP 1.1168 0.2252 4.96 1.0e-06 *** S5 68.4831 15.6697 4.37 1.6e-05 *** BMI 5.6030 0.7171 7.81 4.3e-14 *** SEX -22.8596 5.8358 -3.92 0.0001 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.2 on 431 degrees of freedom Multiple R-squared: 0.518, Adjusted R-squared: 0.507 F-statistic: 46.3 on 10 and 431 DF, p-value: <2e-16Test the removal of the covariate
S3from the model.reducedmodel <- update(fullmodel, . ~ . - S3) anova(reducedmodel, fullmodel)Analysis of Variance Table Model 1: Y ~ AGE + SEX + BMI + BP + S1 + S2 + S4 + S5 + S6 Model 2: Y ~ AGE + SEX + BMI + BP + S1 + S2 + S3 + S4 + S5 + S6 Res.Df RSS Df Sum of Sq F Pr(>F) 1 432 1264649 2 431 1263986 1 663 0.23 0.63Note that \(t_o^2 = 0.4754^2 = 0.226\), from the output in a., is equal to \(F_o\), with both tests leading to the same p-value.
Test the hypothesis \(H_0:\beta_2=5\).
reducedmodel <- update(fullmodel, . ~ . - BMI + offset(5 * BMI)) anova(reducedmodel, fullmodel)Analysis of Variance Table Model 1: Y ~ AGE + SEX + BP + S1 + S2 + S3 + S4 + S5 + S6 + offset(5 * BMI) Model 2: Y ~ AGE + SEX + BMI + BP + S1 + S2 + S3 + S4 + S5 + S6 Res.Df RSS Df Sum of Sq F Pr(>F) 1 432 1266059 2 431 1263986 1 2073 0.71 0.4Test the hypothesis that all regression coefficients related to the covariates
S1toS6are equal.reducedmodel <- lm(Y ~ AGE + BMI + BP + I(S1 + S2 + S3 + S4 + S5 + S6), data = db) anova(reducedmodel, fullmodel)Analysis of Variance Table Model 1: Y ~ AGE + BMI + BP + I(S1 + S2 + S3 + S4 + S5 + S6) Model 2: Y ~ AGE + SEX + BMI + BP + S1 + S2 + S3 + S4 + S5 + S6 Res.Df RSS Df Sum of Sq F Pr(>F) 1 437 1582504 2 431 1263986 6 318518 18.1 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1.3 Sequential F-tests
Starting from the null model, analyse the evolution of \(SSR\) and \(SSE\) as covariates are included in the first-order model.
Along with the summary of the output from the lm function, another analysis tool is the ANOVA table:
anv_full <- anova(fullmodel)
anv_fullAnalysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
AGE 1 92527 92527 31.55 3.5e-08 ***
SEX 1 293 293 0.10 0.75
BMI 1 826955 826955 281.98 < 2e-16 ***
BP 1 129312 129312 44.09 9.4e-11 ***
S1 1 1791 1791 0.61 0.43
S2 1 5058 5058 1.72 0.19
S3 1 237329 237329 80.93 < 2e-16 ***
S4 1 1821 1821 0.62 0.43
S5 1 58856 58856 20.07 9.6e-06 ***
S6 1 3080 3080 1.05 0.31
Residuals 431 1263986 2933
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SSE <- tail(anv_full$`Sum Sq`, 1)
SSE[1] 1263986
SSR <- sum(head(anv_full$`Sum Sq`, -1))
SSR[1] 1357023
SSR + SSE[1] 2621009
SST <- var(db$Y) * (nrow(db) - 1)
SST[1] 2621009
decomp <- data.frame(SSR=c(0, cumsum(head(anv_full$`Sum Sq`, -1))))
decomp$SSE <- SST - decomp$SSR
row.names(decomp) <- c("NULL", head(row.names(anv_full), -1))
decomp$`ΔSSR` <- round(c(0, diff(decomp$SSR)), 0)
decomp SSR SSE ΔSSR
NULL 0 2621009 0
AGE 92527 2528482 92527
SEX 92821 2528188 293
BMI 919776 1701233 826955
BP 1049088 1571921 129312
S1 1050879 1570130 1791
S2 1055937 1565073 5058
S3 1293266 1327743 237329
S4 1295087 1325922 1821
S5 1353943 1267066 58856
S6 1357023 1263986 3080
Note: unlike the results from the model fit, now all of these sequential F-tests depend on the order that terms are included in the model.
anv_full
anova(fullmodel2)Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
AGE 1 92527 92527 31.55 3.5e-08 ***
SEX 1 293 293 0.10 0.75
BMI 1 826955 826955 281.98 < 2e-16 ***
BP 1 129312 129312 44.09 9.4e-11 ***
S1 1 1791 1791 0.61 0.43
S2 1 5058 5058 1.72 0.19
S3 1 237329 237329 80.93 < 2e-16 ***
S4 1 1821 1821 0.62 0.43
S5 1 58856 58856 20.07 9.6e-06 ***
S6 1 3080 3080 1.05 0.31
Residuals 431 1263986 2933
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
S1 1 117824 117824 40.18 5.9e-10 ***
S2 1 3449 3449 1.18 0.2788
S6 1 287816 287816 98.14 < 2e-16 ***
S4 1 323072 323072 110.16 < 2e-16 ***
AGE 1 6672 6672 2.28 0.1322
S3 1 131058 131058 44.69 7.2e-11 ***
BP 1 138048 138048 47.07 2.4e-11 ***
S5 1 94223 94223 32.13 2.6e-08 ***
BMI 1 209863 209863 71.56 4.2e-16 ***
SEX 1 44999 44999 15.34 0.0001 ***
Residuals 431 1263986 2933
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1