Linear Models
Lab#5 – ANOVA models
1 Do tattoos and how you dress influence whether someone will help you?
An experiment was designed to investigate whether dressing and presence of visible tattoos, among other factors, can influence how long a person will interact with a stranger asking for directions. The data is available in the file tattoos.csv.
1.1 Read and explore the data
library(dplyr)
url <- "https://web.tecnico.ulisboa.pt/paulo.soares/aml/data/tattoos.csv"
att <- read.csv(url)
str(att)'data.frame': 80 obs. of 6 variables:
$ dress : chr "casual" "casual" "casual" "casual" ...
$ tattoo : chr "vis" "not" "vis" "not" ...
$ time : int 10 51 31 75 132 112 13 122 7 140 ...
$ gender : chr "Female" "Male" "Female" "Male" ...
$ ethnicity: chr "White" "White" "White" "African American" ...
$ age : chr "appears under 40" "appears over 40" "appears under 40" "appears under 40" ...
att <- att |> mutate(across(where(is.character), as.factor))
summary(att) dress tattoo time gender ethnicity
casual:40 not:40 Min. : 2.0 Female:35 African American:14
prof :40 vis:40 1st Qu.: 9.5 Male :45 Other : 8
Median : 35.0 White :58
Mean : 46.1
3rd Qu.: 65.2
Max. :166.0
age
appears over 40 :39
appears under 40:41
with(att, {
table(dress, tattoo)
}) tattoo
dress not vis
casual 20 20
prof 20 20
1.2 Single factor analysis
Perform separate ANOVA analyses for the factors dress and ethnicity. Can we have the same confidence in the results from both models?
fit <- aov(time ~ dress, data = att)
fitCall:
aov(formula = time ~ dress, data = att)
Terms:
dress Residuals
Sum of Squares 2880 143494
Deg. of Freedom 1 78
Residual standard error: 42.89
Estimated effects may be unbalanced
summary(fit) Df Sum Sq Mean Sq F value Pr(>F)
dress 1 2880 2880 1.57 0.21
Residuals 78 143494 1840
fit <- aov(time ~ ethnicity, data = att)
fitCall:
aov(formula = time ~ ethnicity, data = att)
Terms:
ethnicity Residuals
Sum of Squares 519 145855
Deg. of Freedom 2 77
Residual standard error: 43.52
Estimated effects may be unbalanced
summary(fit) Df Sum Sq Mean Sq F value Pr(>F)
ethnicity 2 519 259 0.14 0.87
Residuals 77 145855 1894
1.3 Two-way analysis
Consider now an ANOVA model including the factors
dressandtattoo. Start by showing that we have a balanced design.att |> summarize(n = n(), cell_mean = mean(time), .by = c(dress, tattoo))dress tattoo n cell_mean 1 casual vis 20 67.60 2 casual not 20 36.55 3 prof not 20 47.50 4 prof vis 20 32.65fit <- lm(time ~ dress * tattoo, data = att) summary(fit)Call: lm(formula = time ~ dress * tattoo, data = att) Residuals: Min 1Q Median 3Q Max -59.60 -29.65 -6.53 18.96 122.35 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 36.55 9.31 3.93 0.00019 *** dressprof 10.95 13.16 0.83 0.40802 tattoovis 31.05 13.16 2.36 0.02088 * dressprof:tattoovis -45.90 18.61 -2.47 0.01592 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 41.6 on 76 degrees of freedom Multiple R-squared: 0.101, Adjusted R-squared: 0.0651 F-statistic: 2.83 on 3 and 76 DF, p-value: 0.0438anova(fit)Analysis of Variance Table Response: time Df Sum Sq Mean Sq F value Pr(>F) dress 1 2880 2880 1.66 0.201 tattoo 1 1312 1312 0.76 0.387 dress:tattoo 1 10534 10534 6.08 0.016 * Residuals 76 131647 1732 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Order of factors fit <- aov(time ~ tattoo * dress, data = att) summary(fit)Df Sum Sq Mean Sq F value Pr(>F) tattoo 1 1312 1312 0.76 0.387 dress 1 2880 2880 1.66 0.201 tattoo:dress 1 10534 10534 6.08 0.016 * Residuals 76 131647 1732 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1par(mfrow = c(2, 2)) plot(fit)par(mfrow = c(1, 1)) ci <- TukeyHSD(fit) plot(ci)ciTukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = time ~ tattoo * dress, data = att) $tattoo diff lwr upr p adj vis-not 8.1 -10.44 26.64 0.3868 $dress diff lwr upr p adj prof-casual -12 -30.54 6.535 0.2012 $`tattoo:dress` diff lwr upr p adj vis:casual-not:casual 31.05 -3.522 65.6221 0.0939 not:prof-not:casual 10.95 -23.622 45.5221 0.8391 vis:prof-not:casual -3.90 -38.472 30.6721 0.9909 not:prof-vis:casual -20.10 -54.672 14.4721 0.4265 vis:prof-vis:casual -34.95 -69.522 -0.3779 0.0466 vis:prof-not:prof -14.85 -49.422 19.7221 0.6733NoteRecommendationThere is some mild evidence that if you have visible tattoos, to get the best attention from strangers when asking for directions, you better dress casually.
Fit a second ANOVA model with the factors
dressandethnicity. Check that now we have an unbalanced design and explore some consequence of that.att |> summarize(n = n(), cell_mean = mean(time), .by = c(dress, ethnicity))dress ethnicity n cell_mean 1 casual White 27 60.07 2 casual African American 9 45.89 3 casual Other 4 12.00 4 prof Other 4 66.50 5 prof White 31 34.19 6 prof African American 5 55.40fit <- lm(time ~ dress * ethnicity, data = att) summary(fit)Call: lm(formula = time ~ dress * ethnicity, data = att) Residuals: Min 1Q Median 3Q Max -63.5 -30.2 -9.3 19.1 120.8 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 45.89 13.97 3.29 0.0016 ** dressprof 9.51 23.37 0.41 0.6853 ethnicityOther -33.89 25.18 -1.35 0.1825 ethnicityWhite 14.19 16.13 0.88 0.3820 dressprof:ethnicityOther 44.99 37.74 1.19 0.2371 dressprof:ethnicityWhite -35.39 25.85 -1.37 0.1751 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 41.9 on 74 degrees of freedom Multiple R-squared: 0.112, Adjusted R-squared: 0.0522 F-statistic: 1.87 on 5 and 74 DF, p-value: 0.11anova(fit)Analysis of Variance Table Response: time Df Sum Sq Mean Sq F value Pr(>F) dress 1 2880 2880 1.64 0.204 ethnicity 2 424 212 0.12 0.887 dress:ethnicity 2 13112 6556 3.73 0.029 * Residuals 74 129958 1756 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Order of factors fit <- aov(time ~ ethnicity * dress, data = att) summary(fit)Df Sum Sq Mean Sq F value Pr(>F) ethnicity 2 519 259 0.15 0.863 dress 1 2785 2785 1.59 0.212 ethnicity:dress 2 13112 6556 3.73 0.029 * Residuals 74 129958 1756 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1WarningWarningWith unbalanced data, results become dependent on the order of the factors and, if using only basic
Ranova functions, they can be unreliable.