Linear Models
Lab#7 – Logistic regression
1 Wells in Bangladesh
Many of the drinking water wells used in Bangladesh and other South Asian countries are contaminated with naturally occurring arsenic (\(\ce{As2O3}\)), which is estimated to affect around 100 million people. Arsenic is a cumulative poison, meaning the risk of cancer and other diseases increases proportionally with exposure.
Any settlement can have wells with varying levels of arsenic. Unfortunately, even if a neighboring well is safe, that doesn’t mean your well is too. Conversely, if your well has a high level of arsenic, it’s possible you could find a safe well nearby – provided you’re willing to travel the necessary distance and your neighbor agrees to share their well.
A research team measured arsenic levels in all wells in a small region of Bangladesh and classified each as either “safe” (less than 50 μg/L) or “unsafe” (at least 50 μg/L). Households that drew water from unsafe wells were encouraged to switch to safe, neighboring, or community wells, or to construct new wells. Several years later, the researchers returned to the region to find out who had switched wells.
The data file is available at:
https://web.tecnico.ulisboa.pt/paulo.soares/aml/dados/wells.csv (82.3KB).
The five columns in this file contain the values for the following variables:
switch: indicates whether or not a household has changed well (response variable)
arsenic: arsenic level of the household well (100 μg/L)
dist: distance to the nearest safe well (m)
assoc: indicates whether any member of the household actively participates in community associations
educ: educational level of the household representative (years of schooling)
This study aims to adjust possible logistic regression models to relate the binary response with other information collected about each household.
1.1 Exploratory analysis
Read and inspect the data file and convert the variable dist to hectometers. Analyze the associations between the covariates and the response variable.
library(tidyverse)
theme_set(theme_light())
url <- "https://web.tecnico.ulisboa.pt/paulo.soares/iadme/dados/wells.csv"
wells <- read.csv(url)
str(wells)'data.frame': 3020 obs. of 5 variables:
$ switch : int 1 1 0 1 1 1 1 1 1 1 ...
$ arsenic: num 2.36 0.71 2.07 1.15 1.1 3.9 2.97 3.24 3.28 2.52 ...
$ dist : num 16.8 47.3 21 21.5 40.9 ...
$ assoc : int 0 0 0 0 1 1 1 0 1 1 ...
$ educ : int 0 0 10 12 14 9 4 10 0 0 ...
wells$assoc <- as.factor(wells$assoc)
wells$switch <- as.factor(wells$switch)
wells$dist <- wells$dist / 100
summary(wells) switch arsenic dist assoc educ
0:1283 Min. :0.51 Min. :0.00387 0:1743 Min. : 0.00
1:1737 1st Qu.:0.82 1st Qu.:0.21117 1:1277 1st Qu.: 0.00
Median :1.30 Median :0.36762 Median : 5.00
Mean :1.66 Mean :0.48332 Mean : 4.83
3rd Qu.:2.20 3rd Qu.:0.64041 3rd Qu.: 8.00
Max. :9.65 Max. :3.39531 Max. :17.00
ggplot(wells) +
geom_boxplot(aes(arsenic, y = switch, color = switch)) +
theme(legend.position = "none")ggplot(wells) +
geom_boxplot(aes(dist, y = switch, color = switch)) +
theme(legend.position = "none")tab <- with(wells, table(switch, assoc))
prop.table(tab, margin = 2) assoc
switch 0 1
0 0.4096 0.4456
1 0.5904 0.5544
1.2 Logistic regression
Start by fitting the first-order logistic regression model with all covariates, and then consider a second-order model with all interactions.
model1 <- glm(switch ~ ., family = "binomial", data = wells) summary(model1)Call: glm(formula = switch ~ ., family = "binomial", data = wells) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.15671 0.09960 -1.57 0.12 arsenic 0.46702 0.04160 11.23 < 2e-16 *** dist -0.89611 0.10458 -8.57 < 2e-16 *** assoc1 -0.12430 0.07697 -1.61 0.11 educ 0.04245 0.00959 4.43 9.5e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 4118.1 on 3019 degrees of freedom Residual deviance: 3907.8 on 3015 degrees of freedom AIC: 3918 Number of Fisher Scoring iterations: 4par(mfrow=c(2,2)) plot(model1)par(mfrow=c(1,1))CautionNoteFor this particular family of GLM’s, residuals are . . . weird and less useful than LM’s residuals.
model2 <- glm(switch ~ .^2, family = "binomial", data = wells) summary(model2)Call: glm(formula = switch ~ .^2, family = "binomial", data = wells) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.0577 0.1765 -0.33 0.7435 arsenic 0.5337 0.0945 5.65 1.6e-08 *** dist -1.2161 0.2793 -4.35 1.3e-05 *** assoc1 0.1389 0.1892 0.73 0.4629 educ -0.0132 0.0214 -0.62 0.5365 arsenic:dist -0.1100 0.1032 -1.07 0.2862 arsenic:assoc1 -0.1626 0.0842 -1.93 0.0536 . arsenic:educ 0.0174 0.0110 1.58 0.1132 dist:assoc1 0.2188 0.2148 1.02 0.3083 dist:educ 0.0838 0.0268 3.13 0.0018 ** assoc1:educ -0.0276 0.0198 -1.39 0.1641 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 4118.1 on 3019 degrees of freedom Residual deviance: 3883.4 on 3009 degrees of freedom AIC: 3905 Number of Fisher Scoring iterations: 4Based on the two fitted models, look for a reduced model that appears adequate. Analyze the variation in goodness-of-fit measures of the successive models and compare the models with the highest and lowest degrees of freedom using an appropriate hypothesis test.
model3 <- glm(switch ~ . - assoc + dist:educ, family = "binomial", data = wells) summary(model3)Call: glm(formula = switch ~ . - assoc + dist:educ, family = "binomial", data = wells) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.000496 0.109615 0.00 0.9964 arsenic 0.480599 0.041987 11.45 < 2e-16 *** dist -1.389852 0.171884 -8.09 6.2e-16 *** educ -0.002077 0.015255 -0.14 0.8917 dist:educ 0.095636 0.025680 3.72 0.0002 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 4118.1 on 3019 degrees of freedom Residual deviance: 3896.2 on 3015 degrees of freedom AIC: 3906 Number of Fisher Scoring iterations: 4model4 <- glm(switch ~ arsenic + dist + dist:educ, family = "binomial", data = wells) summ4 <- summary(model4) summ4Call: glm(formula = switch ~ arsenic + dist + dist:educ, family = "binomial", data = wells) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.00974 0.07978 -0.12 0.9 arsenic 0.48041 0.04196 11.45 < 2e-16 *** dist -1.37566 0.13647 -10.08 < 2e-16 *** dist:educ 0.09292 0.01612 5.77 8.2e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 4118.1 on 3019 degrees of freedom Residual deviance: 3896.2 on 3016 degrees of freedom AIC: 3904 Number of Fisher Scoring iterations: 4measures <- data.frame(deviance = model1$deviance, aic = model1$aic, df = model1$df.residual) measures |> bind_rows(data.frame(deviance = model2$deviance, aic = model2$aic, df = model2$df.residual)) -> measures measures |> bind_rows(data.frame(deviance = model3$deviance, aic = model3$aic, df = model3$df.residual)) -> measures measures |> bind_rows(data.frame(deviance = model4$deviance, aic = model4$aic, df = model4$df.residual)) -> measures measuresdeviance aic df 1 3908 3918 3015 2 3883 3905 3009 3 3896 3906 3015 4 3896 3904 3016anova(model4, model2, test="LRT")Analysis of Deviance Table Model 1: switch ~ arsenic + dist + dist:educ Model 2: switch ~ (arsenic + dist + assoc + educ)^2 Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 3016 3896 2 3009 3883 7 12.8 0.078 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1.3 Prediction
Split the dataset into two subsets: training and testing, in a repeatable manner. The latter set should contain 10% of the observations.
Use the last reduced model to predict the values of switch for the observations in the test set. Use the cutoff value of 0.5 to classify each household according to the estimated probability of change and calculate the proportions of correct classifications.
set.seed(1234)
n <- nrow(wells)
indices <- sample(n, size = 0.1 * n)
train <- wells[-indices,]
test <- wells[indices,]
model <- glm(switch ~ arsenic + dist + dist:educ, family = "binomial", data = train)
pred <- predict(model, test, type = "response")
predictions <- data.frame(obs = test$switch, pred = (pred > 0.5))
predictions$pred <- as.factor(predictions$pred)
levels(predictions$pred) <- levels(predictions$obs)
tab <- with(predictions, table(obs, pred))
prop.table(tab, margin = 1) pred
obs 0 1
0 0.2932 0.7068
1 0.1716 0.8284