4 Analysis of Variance

\(\def\bs#1{\boldsymbol{#1}} \def\b#1{\mathbf{#1}} \def\bul{\small \bullet} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\r}{r} \DeclareMathOperator{\det}{det}\)

Linear model with only qualitative covariates are called ANOVA models.

In this context:

Main inferential questions

Q1. Does a factor influence the response variable?

Q2. If so, how do different levels affect that influence?

Q3. Is there an interaction between factors?

4.1 Single-factor ANOVA

Data:

\(Y_{ij}\) – value of the response variable of the \(j^{th}\) observation for the \(i^{th}\) factor level, with \(i=1,\ldots, r\) and \(j=1,\ldots, n_i\)

Note

If the \(n_i\) are all equal the design is said to be balanced.

Four varieties of wheat were planted in several plots, all with the same area, according to a completely randomized experimental design. The wheat production per plot (in kg) was recorded.

Wheat variety Wheat production (kg)
1 34.3 35.7 37.8 36.9 33.2
2 40.4 38.0 38.3 40.6 38.8 43.5
3 30.5 32.0 33.4 36.2
4 33.1 38.4 35.7 40.9
  variety  n   mean
1       1  5 35.580
2       2  6 39.933
3       3  4 33.025
4       4  4 37.025
5    <NA> 19 36.721

The ANOVA model

\[Y_{ij} = \mu + \alpha_i + E_{ij}\]

for \(i=1,\ldots, r\) and \(j=1,\ldots, n_i\) with \(E_{ij} \sim N(0, \sigma^2)\), uncorrelated.

Note

This is not in the form a general linear model but we will get there.

Problem

This is an non-identifiable model!

\[E[Y_{ij}] = \mu + \alpha_i = (\mu + c) + (\alpha_i - c),\ \forall c\in \mathbb{R}\]

\(\leadsto\) we get an incomplete rank linear model

\(\leadsto\) there are unestimable parametric functions

A solution may be:

  • to impose identifiability restrictions

Possible choices

The response surface:

\[E[Y_{ij}] = \mu + \alpha_i\]

  1. Drop \(\mu\)

    The lack of intercept would invalidate valuable results (\(R^2\), F-tests, . . .)

  2. Set \(\alpha_1=0\)

    Then, \(\mu=E[Y_{1j}]\) and \(\alpha_i=E[Y_{ij}] - E[Y_{1j}]\), for \(i>1\).

Level \(x_2\) \(x_3\) \(x_4\)
1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1

\[\begin{align*} E[Y_{ij}] & = \mu + \alpha_2 x_2 + \alpha_3 x_3 + \alpha_4 x_4 =\\[0.25cm] & = \begin{cases} \mu, & i=1\\ \mu+\alpha_2, & i=2\\ \mu + \alpha_3, & i=3\\ \mu + \alpha_4, & i=4 \end{cases} \end{align*}\]

Adopting this restriction, the model can be simply written as

\[Y_{ij} = \mu_i + E_{ij}\]

with \(E[Y_{ij}] = \mu_i = \mu + \alpha_i\) (\(\alpha_1=0\))

This is called the cell means model.

  1. Set \(\sum_{i=1}^r n_i\alpha_i =0\).

    Then \(\mu=\frac{\sum_{i=1}^r n_i\mu_i}{n}\), an overall mean response and \(\alpha_i=\mu_i - \mu\).

    This is called the factor effects model that requires a different encoding.

Level \(x_1\) \(x_2\) \(x_3\)
1 1 0 0
2 0 1 0
3 0 0 1
4 \(-\frac{n_1}{n_4}\) \(-\frac{n_2}{n_4}\) \(-\frac{n_3}{n_4}\)

\[\begin{align*} E[Y_{ij}] & = \mu + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 =\\[0.25cm] & = \begin{cases} \mu+\alpha_1, & i=1\\ \mu+\alpha_2, & i=2\\ \mu + \alpha_3, & i=3\\ \mu - \frac{1}{n_4}(n_1\alpha_1+n_2\alpha_2+n_3\alpha_3), & i=4 \end{cases} \end{align*}\]

fit1 <- aov(production ~ variety - 1, wheat)
summary(fit1)
          Df Sum Sq Mean Sq F value Pr(>F)    
variety    4  25744    6436    1113 <2e-16 ***
Residuals 15     87       6                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit1.1 <- lm(production ~ variety - 1, wheat)
anova(fit1.1)
Analysis of Variance Table

Response: production
          Df  Sum Sq Mean Sq F value    Pr(>F)    
variety    4 25743.7  6435.9  1112.5 < 2.2e-16 ***
Residuals 15    86.8     5.8                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit1.1)

Call:
lm(formula = production ~ variety - 1, data = wheat)

Residuals:
   Min     1Q Median     3Q    Max 
-3.925 -1.479  0.120  1.347  3.875 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
variety1  35.5800     1.0756   33.08 1.96e-15 ***
variety2  39.9333     0.9819   40.67  < 2e-16 ***
variety3  33.0250     1.2026   27.46 3.06e-14 ***
variety4  37.0250     1.2026   30.79 5.67e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.405 on 15 degrees of freedom
Multiple R-squared:  0.9966,    Adjusted R-squared:  0.9957 
F-statistic:  1113 on 4 and 15 DF,  p-value: < 2.2e-16

Call:
lm(formula = production ~ variety, data = wheat)

Residuals:
   Min     1Q Median     3Q    Max 
-3.925 -1.479  0.120  1.347  3.875 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   35.580      1.076  33.078 1.96e-15 ***
variety2       4.353      1.456   2.989  0.00918 ** 
variety3      -2.555      1.613  -1.584  0.13415    
variety4       1.445      1.613   0.896  0.38462    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.405 on 15 degrees of freedom
Multiple R-squared:  0.5872,    Adjusted R-squared:  0.5046 
F-statistic: 7.112 on 3 and 15 DF,  p-value: 0.003391
contrasts(wheat$variety)
  2 3 4
1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1
contr <- contr.sum(4)
contr[4,] <- -wheat_means$n[1:3] / wheat_means$n[4]
contrasts(wheat$variety) <- contr
contrasts(wheat$variety)
   [,1] [,2] [,3]
1  1.00  0.0    0
2  0.00  1.0    0
3  0.00  0.0    1
4 -1.25 -1.5   -1

Call:
lm(formula = production ~ variety, data = wheat)

Residuals:
   Min     1Q Median     3Q    Max 
-3.925 -1.479  0.120  1.347  3.875 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.7211     0.5518  66.548  < 2e-16 ***
variety1     -1.1411     0.9233  -1.236  0.23554    
variety2      3.2123     0.8122   3.955  0.00127 ** 
variety3     -3.6961     1.0685  -3.459  0.00351 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.405 on 15 degrees of freedom
Multiple R-squared:  0.5872,    Adjusted R-squared:  0.5046 
F-statistic: 7.112 on 3 and 15 DF,  p-value: 0.003391

bartlett.test(production ~ variety, data=wheat)

    Bartlett test of homogeneity of variances

data:  production by variety
Bartlett's K-squared = 1.3645, df = 3, p-value = 0.7139
residuals <- rstudent(fit3)
shapiro.test(residuals)

    Shapiro-Wilk normality test

data:  residuals
W = 0.97234, p-value = 0.822

Inferences in the ANOVA model

The OLS estimators are extremely simple.

\[\begin{align*} Q(\bs{\mu}) & = \sum_{i=1}^r\sum_{j=1}^{n_i}(Y_{ij}- \mu_i)^2 =\\[0.25cm] & = Q_1 + \ldots Q_r =\\[0.25cm] & = \sum_{j=1}^{n_1}(Y_{1j}- \mu_1)^2 + \ldots + \sum_{j=1}^{n_r}(Y_{rj}- \mu_r)^2 \end{align*}\]

\(\frac{dQ_i}{d\mu_i}=0 \iff \mu_i = \dfrac{\sum_{j=1}^{n_i}Y_{ij}}{n_i}=\overline{Y}_{i\bul}\)

The estimators are:

\(\hat{\mu}_i =\overline{Y}_{i\bul}\)

\(\hat{\mu} =\dfrac{\sum_{i=1}^{r}n_i\overline{Y}_{i\bul}}{n}=\overline{Y}_{\bul\bul}\)

\(\hat{\alpha}_i = \overline{Y}_{i\bul} - \overline{Y}_{\bul\bul}\)

From the ML estimator, we use

\(\hat{\sigma}^2 = MSE = \dfrac{SSE}{n-r}=\dfrac{\sum_{i=1}^r\sum_{j=1}^{n_i}\left(Y_{ij}- \overline{Y}_{i\bul}\right)^2}{n-r}\)

Q1. Does a factor influence the response variable?

We need to test

\(H_0: \mu_1=\ldots=\mu_r=\mu\) vs. \(H_1: \exists i,j : \mu_i\neq\mu_j\)

or

\(H_0: \alpha_1=\ldots=\alpha_r=0\) vs. \(H_1: \exists i : \alpha_i\neq 0\)

AS we have seen in the example, testing these hypotheses is equivalent to test the significance of the corresponding regression model, regardless of the chosen encoding.

Q2. How do different levels of the factor influence the response variable?

To handle this question we need to analyse the level means \(\mu_i\) or linear combinations of them, \(\sum_{i=1}^r c_i\mu_i\).

We have,

\[\overline{Y}_{i\bul} \sim N\left(\mu_i, \frac{\sigma^2}{n_i}\right)\text{ independent of }\frac{SSE}{\sigma^2}\sim \chi^2_{(n-r)}\]

And so,

\[\frac{\overline{Y}_{i\bul} - \mu_i}{\sqrt{\frac{MSE}{n_i}}}\sim t_{(n-r)}\]

and, more generally,

\[\frac{\sum_{i=1}^r c_i\overline{Y}_{i\bul} - \sum_{i=1}^r c_i\mu_i}{\sqrt{MSE\sum_{i=1}^r \frac{c_i^2}{n_i}}}\sim t_{(n-r)}\]

To compare the effects of different levels, usually more than one test or confidence interval is used \(\implies\) multiple comparisons

  1. Adjust the confidence/significance levels using the Bonferroni method

  2. Tukey’s Honest Significant Differences method

    Provides all pairwise comparisons of factor level means:

    \[(\overline{Y}_{i\bul}-\overline{Y}_{j\bul}) \pm \frac{1}{\sqrt{2}} F^{-1}_{SR(r, n-r)}(\gamma)\sqrt{MSE\left(\frac{1}{n_i}+\frac{1}{n_j}\right)}\]

    where \(SR\) is the Studentized Range distribution.

  1. Scheffé’s method

    Provides a family of CI for all possible contrasts:

    \[\sum_{i=1}^r c_i\overline{Y}_{i\bul} \pm \sqrt{(r-1) F^{-1}_{F(r-1, n-r)}(\gamma) MSE\sum_{i=1}^r \frac{c_i^2}{n_i}}\]

hsd <- TukeyHSD(fit2)
hsd
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = production ~ variety, data = wheat)

$variety
         diff         lwr       upr     p adj
2-1  4.353333   0.1556748  8.550992 0.0408629
3-1 -2.555000  -7.2052645  2.095264 0.4165311
4-1  1.445000  -3.2052645  6.095264 0.8072531
3-2 -6.908333 -11.3830524 -2.433614 0.0023558
4-2 -2.908333  -7.3830524  1.566386 0.2800552
4-3  4.000000  -0.9018091  8.901809 0.1301695
plot(hsd)

4.2 Two-factor ANOVA

Let us consider now two factors, \(A\) and \(B\) with \(r\) and \(c\) levels.

Data:

\(Y_{ijk}\) – value of the response variable of the \(k^{th}\) observation for the \(i^{th}\) level of factor A and the \(j^{th}\) level of factor B, with \(i=1,\ldots, r\), \(j=1,\ldots, c\) and \(k=1,\ldots, n_{ij}\)

The ANOVA model

\[Y_{ijk} = \mu + \alpha_i + \beta_j+ \gamma_{ij} + E_{ijk}\]

with \(E_{ijk} \sim N(0, \sigma^2)\), uncorrelated.

Note

The model is clearly overparametrized. We have \(1+r+c+rc\) coefficients to model \(rc\) cell means.

  1. Set \(\alpha_1=\beta_1=0\) and \(\gamma_{1j}=\gamma_{i1}=0,\ \forall i, j\)

    Writing \(E[Y_{ijk}]=\mu_{i,j}\) we have:

    \(\mu=\mu_{1,1}\)

    \(\alpha_i=\mu_{i,1}-\mu_{1,1}\)

    \(\beta_j=\mu_{1,j}-\mu_{1,1}\)

    \(\gamma_{ij}=\mu_{i,j}-\mu_{i,1}-\mu_{1,j}+\mu_{1,1}\)

    Again, the cells means model.

  1. Set:

    \(\sum_{i=1}^r n_{i\bul}\alpha_i=0\)

    \(\sum_{j=1}^c n_{\bul j}\beta_j=0\)

    \(\sum_{i=1}^r n_{ij}\gamma_{ij}=0,\ \forall j\)

    \(\sum_{j=1}^c n_{ij}\gamma_{ij}=0,\ \forall i\)

    Now, \(\mu=\frac{\sum_{i=1}^r \sum_{j=1}^c n_{ij}\mu_{ij}}{n}\) and the remaining parameters are deviations to this overall mean. The factor effects model.

Similarly to the previous case we have \(\hat{\mu}_{ij}=\overline{Y}_{ij\bul}\) and

\(\hat{\mu} =\overline{Y}_{\bul\bul\bul}\)

\(\hat{\alpha}_i = \overline{Y}_{i\bul\bul} - \overline{Y}_{\bul\bul\bul}\)

\(\hat{\beta}_j = \overline{Y}_{\bul j \bul} - \overline{Y}_{\bul\bul\bul}\)

\(\hat{\gamma}_{ij} = \overline{Y}_{ij\bul} -\overline{Y}_{i\bul\bul} -\overline{Y}_{\bul j \bul} + \overline{Y}_{\bul\bul\bul}\)

Q3. Is there significant interaction between the factors?

No

Data can be analysed separately for each factor.

Yes

The analysis of the factors’ main effects loses importance.

For balanced data, all previous tools apply but, in the unbalanced case, some results become dependent on the order of the factors or, more precisely, on the encoding used. This requires:

  1. further considerations on the choice of encodings;

  2. different decompositions of the SST other than the sequential one (Type I ANOVA). Those lead to Types II and III F-tests.

pages
grid_view zoom_out zoom_in swap_horiz autorenew visibility_off
menu
fullscreen
stylus_note ink_eraser delete_forever content_paste navigate_before navigate_next
draw