Linear Models
Lab#1 – Simple linear regression model
1 Diabetes dataset
Ten baseline variables, AGE (in years), SEX, body mass index (BMI), average blood pressure (BP), and six blood serum measurements (S1 to S6) were obtained for each of 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline (Y). The data is available in the file diabetes.txt (18.1 KB) and a brief description of the variables is here.
1.1 Data reading and exploratory analysis
Read the data, inspect it and repair anything that may need to be fixed or that can be improved.
url <- "https://web.tecnico.ulisboa.pt/paulo.soares/aml/data/diabetes.txt" db <- read.delim(url) str(db)'data.frame': 442 obs. of 11 variables: $ AGE: int 59 48 72 24 50 23 36 66 60 29 ... $ SEX: int 2 1 2 1 1 1 2 2 2 1 ... $ BMI: num 32.1 21.6 30.5 25.3 23 22.6 22 26.2 32.1 30 ... $ BP : num 101 87 93 84 101 89 90 114 83 85 ... $ S1 : int 157 183 156 198 192 139 160 255 179 180 ... $ S2 : num 93.2 103.2 93.6 131.4 125.4 ... $ S3 : num 38 70 41 40 52 61 50 56 42 43 ... $ S4 : num 4 3 4 5 4 2 3 4.55 4 4 ... $ S5 : num 4.86 3.89 4.67 4.89 4.29 ... $ S6 : int 87 69 85 89 80 68 82 92 94 88 ... $ Y : int 151 75 141 206 135 97 138 63 110 310 ...summary(db)AGE SEX BMI BP S1 Min. :19.0 Min. :1.00 Min. :18.0 Min. : 62.0 Min. : 97 1st Qu.:38.2 1st Qu.:1.00 1st Qu.:23.2 1st Qu.: 84.0 1st Qu.:164 Median :50.0 Median :1.00 Median :25.7 Median : 93.0 Median :186 Mean :48.5 Mean :1.47 Mean :26.4 Mean : 94.6 Mean :189 3rd Qu.:59.0 3rd Qu.:2.00 3rd Qu.:29.3 3rd Qu.:105.0 3rd Qu.:210 Max. :79.0 Max. :2.00 Max. :42.2 Max. :133.0 Max. :301 S2 S3 S4 S5 S6 Min. : 41.6 Min. :22.0 Min. :2.00 Min. :3.26 Min. : 58.0 1st Qu.: 96.0 1st Qu.:40.2 1st Qu.:3.00 1st Qu.:4.28 1st Qu.: 83.2 Median :113.0 Median :48.0 Median :4.00 Median :4.62 Median : 91.0 Mean :115.4 Mean :49.8 Mean :4.07 Mean :4.64 Mean : 91.3 3rd Qu.:134.5 3rd Qu.:57.8 3rd Qu.:5.00 3rd Qu.:5.00 3rd Qu.: 98.0 Max. :242.4 Max. :99.0 Max. :9.09 Max. :6.11 Max. :124.0 Y Min. : 25 1st Qu.: 87 Median :140 Mean :152 3rd Qu.:212 Max. :346db$SEX <- as.factor(db$SEX) str(db)'data.frame': 442 obs. of 11 variables: $ AGE: int 59 48 72 24 50 23 36 66 60 29 ... $ SEX: Factor w/ 2 levels "1","2": 2 1 2 1 1 1 2 2 2 1 ... $ BMI: num 32.1 21.6 30.5 25.3 23 22.6 22 26.2 32.1 30 ... $ BP : num 101 87 93 84 101 89 90 114 83 85 ... $ S1 : int 157 183 156 198 192 139 160 255 179 180 ... $ S2 : num 93.2 103.2 93.6 131.4 125.4 ... $ S3 : num 38 70 41 40 52 61 50 56 42 43 ... $ S4 : num 4 3 4 5 4 2 3 4.55 4 4 ... $ S5 : num 4.86 3.89 4.67 4.89 4.29 ... $ S6 : int 87 69 85 89 80 68 82 92 94 88 ... $ Y : int 151 75 141 206 135 97 138 63 110 310 ...For now we can only consider a single covariate. Try to convince yourself that
BMIis a good starting candidate producing, for instance, all the scatter plots between the response variable and each covariate and by analyzing the linear correlation between all variables with both numerical measures and graphical representations.library(ggplot2) library(dplyr) db |> tidyr::pivot_longer(-c(Y, SEX), names_to = "Covariate", values_to = "Value") |> ggplot() + geom_point(aes(Value, Y, color = SEX), size = 1/3) + facet_wrap(~ Covariate, scales = "free_x") + theme(legend.position = "top")db |> select(-SEX) |> cor()AGE BMI BP S1 S2 S3 S4 S5 S6 AGE 1.00000 0.1851 0.3354 0.26006 0.2192 -0.07518 0.2038 0.2708 0.3017 BMI 0.18508 1.0000 0.3954 0.24978 0.2612 -0.36681 0.4138 0.4462 0.3887 BP 0.33543 0.3954 1.0000 0.24246 0.1855 -0.17876 0.2577 0.3935 0.3904 S1 0.26006 0.2498 0.2425 1.00000 0.8967 0.05152 0.5422 0.5155 0.3257 S2 0.21924 0.2612 0.1855 0.89666 1.0000 -0.19646 0.6598 0.3184 0.2906 S3 -0.07518 -0.3668 -0.1788 0.05152 -0.1965 1.00000 -0.7385 -0.3986 -0.2737 S4 0.20384 0.4138 0.2577 0.54221 0.6598 -0.73849 1.0000 0.6179 0.4172 S5 0.27077 0.4462 0.3935 0.51550 0.3184 -0.39858 0.6179 1.0000 0.4647 S6 0.30173 0.3887 0.3904 0.32572 0.2906 -0.27370 0.4172 0.4647 1.0000 Y 0.18789 0.5865 0.4415 0.21202 0.1741 -0.39479 0.4305 0.5659 0.3825 Y AGE 0.1879 BMI 0.5865 BP 0.4415 S1 0.2120 S2 0.1741 S3 -0.3948 S4 0.4305 S5 0.5659 S6 0.3825 Y 1.0000
1.2 Linear regression in
The
lmfunction is used to fit linear models, including multivariate ones. Use it to fit a SLRM with the covariate referred before and explore the returned object.model <- lm(Y ~ BMI, data = db) modelCall: lm(formula = Y ~ BMI, data = db) Coefficients: (Intercept) BMI -117.8 10.2ms <- summary(model) msCall: lm(formula = Y ~ BMI, data = db) Residuals: Min 1Q Median 3Q Max -164.92 -43.57 -8.65 46.34 154.88 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -117.773 18.019 -6.54 1.8e-10 *** BMI 10.233 0.674 15.19 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 62.5 on 440 degrees of freedom Multiple R-squared: 0.344, Adjusted R-squared: 0.342 F-statistic: 231 on 1 and 440 DF, p-value: <2e-16ms$coefficients[2, 1][1] 10.23ms$sigma[1] 62.52ms$r.squared[1] 0.3439Compute 96% confidence intervals for the regression coefficients and for the mean value of the response variable for people with
BMIequal to 25 and 40.confint(model, level = 0.96)2 % 98 % (Intercept) -154.890 -80.66 BMI 8.845 11.62new_data <- data.frame(BMI = c(25, 40)) predict(model, new_data, level = 0.96, interval = "confidence")fit lwr upr 1 138.1 131.6 144.5 2 291.6 271.7 311.4g + geom_smooth(method = lm, level = 0.96)predict(model, new_data, level = 0.96, interval = "prediction")fit lwr upr 1 138.1 9.123 267.0 2 291.6 161.255 421.8gist <- "https://gist.github.com/andrewbaxter439/b508a60786f8af3c0be7b381a667ae07" devtools::source_gist(gist, quiet = TRUE) g + stat_predict(method = "lm", geom = "ribbon", fill = "hotpink", colour = NA, level = 0.96, alpha = 0.075) + geom_smooth(method = lm, level = 0.96)Notes
The last graphic uses external code that extends
ggplot2capabilities with thestat_predictfunction.Both bands can be misleading regarding the global confidence level. Corrections are required.
References
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) Least Angle Regression, Annals of Statistics (with discussion), 407-499.