Linear Models

Lab#7 – Logistic regression

Author

Paulo Soares

Published

December 19, 2025

1 Wells in Bangladesh

Many of the drinking water wells used in Bangladesh and other South Asian countries are contaminated with naturally occurring arsenic (\(\ce{As2O3}\)), which is estimated to affect around 100 million people. Arsenic is a cumulative poison, meaning the risk of cancer and other diseases increases proportionally with exposure.

Any settlement can have wells with varying levels of arsenic. Unfortunately, even if a neighboring well is safe, that doesn’t mean your well is too. Conversely, if your well has a high level of arsenic, it’s possible you could find a safe well nearby – provided you’re willing to travel the necessary distance and your neighbor agrees to share their well.

A research team measured arsenic levels in all wells in a small region of Bangladesh and classified each as either “safe” (less than 50 μg/L) or “unsafe” (at least 50 μg/L). Households that drew water from unsafe wells were encouraged to switch to safe, neighboring, or community wells, or to construct new wells. Several years later, the researchers returned to the region to find out who had switched wells.

The data file is available at:

https://web.tecnico.ulisboa.pt/paulo.soares/aml/dados/wells.csv (82.3KB).

The five columns in this file contain the values for the following variables:

switch: indicates whether or not a household has changed well (response variable)

arsenic: arsenic level of the household well (100 μg/L)

dist: distance to the nearest safe well (m)

assoc: indicates whether any member of the household actively participates in community associations

educ: educational level of the household representative (years of schooling)

This study aims to adjust possible logistic regression models to relate the binary response with other information collected about each household.

1.1 Exploratory analysis

Read and inspect the data file and convert the variable dist to hectometers. Analyze the associations between the covariates and the response variable.

library(tidyverse)
theme_set(theme_light())

url <- "https://web.tecnico.ulisboa.pt/paulo.soares/iadme/dados/wells.csv"
wells <- read.csv(url)
str(wells)

'data.frame':   3020 obs. of  5 variables:
 $ switch : int  1 1 0 1 1 1 1 1 1 1 ...
 $ arsenic: num  2.36 0.71 2.07 1.15 1.1 3.9 2.97 3.24 3.28 2.52 ...
 $ dist   : num  16.8 47.3 21 21.5 40.9 ...
 $ assoc  : int  0 0 0 0 1 1 1 0 1 1 ...
 $ educ   : int  0 0 10 12 14 9 4 10 0 0 ...

wells$assoc <- as.factor(wells$assoc)
wells$switch <- as.factor(wells$switch)
wells$dist <- wells$dist / 100

summary(wells)

 switch      arsenic          dist         assoc         educ      
 0:1283   Min.   :0.51   Min.   :0.00387   0:1743   Min.   : 0.00  
 1:1737   1st Qu.:0.82   1st Qu.:0.21117   1:1277   1st Qu.: 0.00  
          Median :1.30   Median :0.36762            Median : 5.00  
          Mean   :1.66   Mean   :0.48332            Mean   : 4.83  
          3rd Qu.:2.20   3rd Qu.:0.64041            3rd Qu.: 8.00  
          Max.   :9.65   Max.   :3.39531            Max.   :17.00

ggplot(wells) +
  geom_boxplot(aes(arsenic, y = switch, color = switch)) +
  theme(legend.position = "none")

ggplot(wells) +
  geom_boxplot(aes(dist, y = switch, color = switch)) +
  theme(legend.position = "none")

tab <- with(wells, table(switch, assoc))
prop.table(tab, margin = 2)

      assoc
switch      0      1
     0 0.4096 0.4456
     1 0.5904 0.5544

wells |>
  summarise(n=n(), .by = c(switch, educ)) |> 
  ggplot() +
  geom_col(aes(educ, n, fill = switch))

wells |>
  summarise(n=n(), .by = c(switch, educ)) |> 
  filter(n > 1) |> 
  mutate(prop = n / sum(n), .by = educ) |> 
  ggplot() +
    geom_col(aes(educ, prop, fill = switch))

1.2 Logistic regression

Start by fitting the first-order logistic regression model with all covariates, and then consider a second-order model with all interactions.

model1 <- glm(switch ~ ., family = "binomial", data = wells)
summary(model1)


Call:
glm(formula = switch ~ ., family = "binomial", data = wells)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.15671    0.09960   -1.57     0.12    
arsenic      0.46702    0.04160   11.23  < 2e-16 ***
dist        -0.89611    0.10458   -8.57  < 2e-16 ***
assoc1      -0.12430    0.07697   -1.61     0.11    
educ         0.04245    0.00959    4.43  9.5e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4118.1  on 3019  degrees of freedom
Residual deviance: 3907.8  on 3015  degrees of freedom
AIC: 3918

Number of Fisher Scoring iterations: 4

par(mfrow=c(2,2))
plot(model1)

par(mfrow=c(1,1))

Note

For this particular family of GLM’s, residuals are . . . weird and less useful than LM’s residuals.

model2 <- glm(switch ~ .^2, family = "binomial", data = wells)
summary(model2)


Call:
glm(formula = switch ~ .^2, family = "binomial", data = wells)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -0.0577     0.1765   -0.33   0.7435    
arsenic          0.5337     0.0945    5.65  1.6e-08 ***
dist            -1.2161     0.2793   -4.35  1.3e-05 ***
assoc1           0.1389     0.1892    0.73   0.4629    
educ            -0.0132     0.0214   -0.62   0.5365    
arsenic:dist    -0.1100     0.1032   -1.07   0.2862    
arsenic:assoc1  -0.1626     0.0842   -1.93   0.0536 .  
arsenic:educ     0.0174     0.0110    1.58   0.1132    
dist:assoc1      0.2188     0.2148    1.02   0.3083    
dist:educ        0.0838     0.0268    3.13   0.0018 ** 
assoc1:educ     -0.0276     0.0198   -1.39   0.1641    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4118.1  on 3019  degrees of freedom
Residual deviance: 3883.4  on 3009  degrees of freedom
AIC: 3905

Number of Fisher Scoring iterations: 4

Based on the two fitted models, look for a reduced model that appears adequate. Analyze the variation in goodness-of-fit measures of the successive models and compare the models with the highest and lowest degrees of freedom using an appropriate hypothesis test.

model3 <- glm(switch ~ . - assoc + dist:educ, family = "binomial", data = wells)
summary(model3)


Call:
glm(formula = switch ~ . - assoc + dist:educ, family = "binomial", 
    data = wells)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.000496   0.109615    0.00   0.9964    
arsenic      0.480599   0.041987   11.45  < 2e-16 ***
dist        -1.389852   0.171884   -8.09  6.2e-16 ***
educ        -0.002077   0.015255   -0.14   0.8917    
dist:educ    0.095636   0.025680    3.72   0.0002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4118.1  on 3019  degrees of freedom
Residual deviance: 3896.2  on 3015  degrees of freedom
AIC: 3906

Number of Fisher Scoring iterations: 4

model4 <- glm(switch ~ arsenic + dist + dist:educ, family = "binomial", data = wells)
summ4 <- summary(model4)
summ4


Call:
glm(formula = switch ~ arsenic + dist + dist:educ, family = "binomial", 
    data = wells)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.00974    0.07978   -0.12      0.9    
arsenic      0.48041    0.04196   11.45  < 2e-16 ***
dist        -1.37566    0.13647  -10.08  < 2e-16 ***
dist:educ    0.09292    0.01612    5.77  8.2e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4118.1  on 3019  degrees of freedom
Residual deviance: 3896.2  on 3016  degrees of freedom
AIC: 3904

Number of Fisher Scoring iterations: 4

measures <- data.frame(deviance = model1$deviance, aic = model1$aic,
                      df = model1$df.residual)

measures |> 
  bind_rows(data.frame(deviance = model2$deviance, aic = model2$aic,
                       df = model2$df.residual)) -> measures

measures |> 
  bind_rows(data.frame(deviance = model3$deviance, aic = model3$aic,
                       df = model3$df.residual)) -> measures

measures |> 
  bind_rows(data.frame(deviance = model4$deviance, aic = model4$aic,
                        df = model4$df.residual)) -> measures

measures

  deviance  aic   df
1     3908 3918 3015
2     3883 3905 3009
3     3896 3906 3015
4     3896 3904 3016

anova(model4, model2, test="LRT")

Analysis of Deviance Table

Model 1: switch ~ arsenic + dist + dist:educ
Model 2: switch ~ (arsenic + dist + assoc + educ)^2
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1      3016       3896                       
2      3009       3883  7     12.8    0.078 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.3 Prediction

Split the dataset into two subsets: training and testing, in a repeatable manner. The latter set should contain 10% of the observations.

Use the last reduced model to predict the values of switch for the observations in the test set. Use the cutoff value of 0.5 to classify each household according to the estimated probability of change and calculate the proportions of correct classifications.

set.seed(1234)

n <- nrow(wells)
indices <- sample(n, size = 0.1 * n)
train <- wells[-indices,]
test <- wells[indices,]

model <- glm(switch ~ arsenic + dist + dist:educ, family = "binomial", data = train)
pred <- predict(model, test, type = "response")

predictions <- data.frame(obs = test$switch, pred = (pred > 0.5))

predictions$pred <- as.factor(predictions$pred)
levels(predictions$pred) <- levels(predictions$obs)

tab <- with(predictions, table(obs, pred))
prop.table(tab, margin = 1)

   pred
obs      0      1
  0 0.2932 0.7068
  1 0.1716 0.8284