\(\def\bs#1{\boldsymbol{#1}} \def\b#1{\mathbf{#1}} \def\bul{\small \bullet} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\r}{r} \DeclareMathOperator{\det}{det}\)
Linear model with only qualitative covariates are called ANOVA models.
In this context:
covariates \(\equiv\) factors
different values of a covariate \(\equiv\) levels of the factor
regression coefficients \(\equiv\) effects
Main inferential questions
Q1. Does a factor influence the response variable?
Q2. If so, how do different levels affect that influence?
Q3. Is there an interaction between factors?
Data:
\(Y_{ij}\) – value of the response variable of the \(j^{th}\) observation for the \(i^{th}\) factor level, with \(i=1,\ldots, r\) and \(j=1,\ldots, n_i\)
Note
If the \(n_i\) are all equal the design is said to be balanced.
Four varieties of wheat were planted in several plots, all with the same area, according to a completely randomized experimental design. The wheat production per plot (in kg) was recorded.
| Wheat variety | Wheat production (kg) |
|---|---|
| 1 | 34.3 35.7 37.8 36.9 33.2 |
| 2 | 40.4 38.0 38.3 40.6 38.8 43.5 |
| 3 | 30.5 32.0 33.4 36.2 |
| 4 | 33.1 38.4 35.7 40.9 |
variety n mean
1 1 5 35.580
2 2 6 39.933
3 3 4 33.025
4 4 4 37.025
5 <NA> 19 36.721
The ANOVA model
\[Y_{ij} = \mu + \alpha_i + E_{ij}\]
for \(i=1,\ldots, r\) and \(j=1,\ldots, n_i\) with \(E_{ij} \sim N(0, \sigma^2)\), uncorrelated.
Note
This is not in the form a general linear model but we will get there.
Problem
This is an non-identifiable model!
\[E[Y_{ij}] = \mu + \alpha_i = (\mu + c) + (\alpha_i - c),\ \forall c\in \mathbb{R}\]
\(\leadsto\) we get an incomplete rank linear model
\(\leadsto\) there are unestimable parametric functions
A solution may be:
Possible choices
The response surface:
\[E[Y_{ij}] = \mu + \alpha_i\]
Drop \(\mu\)
The lack of intercept would invalidate valuable results (\(R^2\), F-tests, . . .)
Set \(\alpha_1=0\)
Then, \(\mu=E[Y_{1j}]\) and \(\alpha_i=E[Y_{ij}] - E[Y_{1j}]\), for \(i>1\).
| Level | \(x_2\) | \(x_3\) | \(x_4\) |
|---|---|---|---|
| 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 |
| 3 | 0 | 1 | 0 |
| 4 | 0 | 0 | 1 |
\[\begin{align*} E[Y_{ij}] & = \mu + \alpha_2 x_2 + \alpha_3 x_3 + \alpha_4 x_4 =\\[0.25cm] & = \begin{cases} \mu, & i=1\\ \mu+\alpha_2, & i=2\\ \mu + \alpha_3, & i=3\\ \mu + \alpha_4, & i=4 \end{cases} \end{align*}\]
Adopting this restriction, the model can be simply written as
\[Y_{ij} = \mu_i + E_{ij}\]
with \(E[Y_{ij}] = \mu_i = \mu + \alpha_i\) (\(\alpha_1=0\))
This is called the cell means model.
Set \(\sum_{i=1}^r n_i\alpha_i =0\).
Then \(\mu=\frac{\sum_{i=1}^r n_i\mu_i}{n}\), an overall mean response and \(\alpha_i=\mu_i - \mu\).
This is called the factor effects model that requires a different encoding.
| Level | \(x_1\) | \(x_2\) | \(x_3\) |
|---|---|---|---|
| 1 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 |
| 3 | 0 | 0 | 1 |
| 4 | \(-\frac{n_1}{n_4}\) | \(-\frac{n_2}{n_4}\) | \(-\frac{n_3}{n_4}\) |
\[\begin{align*} E[Y_{ij}] & = \mu + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 =\\[0.25cm] & = \begin{cases} \mu+\alpha_1, & i=1\\ \mu+\alpha_2, & i=2\\ \mu + \alpha_3, & i=3\\ \mu - \frac{1}{n_4}(n_1\alpha_1+n_2\alpha_2+n_3\alpha_3), & i=4 \end{cases} \end{align*}\]
Call:
lm(formula = production ~ variety - 1, data = wheat)
Residuals:
Min 1Q Median 3Q Max
-3.925 -1.479 0.120 1.347 3.875
Coefficients:
Estimate Std. Error t value Pr(>|t|)
variety1 35.5800 1.0756 33.08 1.96e-15 ***
variety2 39.9333 0.9819 40.67 < 2e-16 ***
variety3 33.0250 1.2026 27.46 3.06e-14 ***
variety4 37.0250 1.2026 30.79 5.67e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.405 on 15 degrees of freedom
Multiple R-squared: 0.9966, Adjusted R-squared: 0.9957
F-statistic: 1113 on 4 and 15 DF, p-value: < 2.2e-16
Call:
lm(formula = production ~ variety, data = wheat)
Residuals:
Min 1Q Median 3Q Max
-3.925 -1.479 0.120 1.347 3.875
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.580 1.076 33.078 1.96e-15 ***
variety2 4.353 1.456 2.989 0.00918 **
variety3 -2.555 1.613 -1.584 0.13415
variety4 1.445 1.613 0.896 0.38462
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.405 on 15 degrees of freedom
Multiple R-squared: 0.5872, Adjusted R-squared: 0.5046
F-statistic: 7.112 on 3 and 15 DF, p-value: 0.003391
Call:
lm(formula = production ~ variety, data = wheat)
Residuals:
Min 1Q Median 3Q Max
-3.925 -1.479 0.120 1.347 3.875
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.7211 0.5518 66.548 < 2e-16 ***
variety1 -1.1411 0.9233 -1.236 0.23554
variety2 3.2123 0.8122 3.955 0.00127 **
variety3 -3.6961 1.0685 -3.459 0.00351 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.405 on 15 degrees of freedom
Multiple R-squared: 0.5872, Adjusted R-squared: 0.5046
F-statistic: 7.112 on 3 and 15 DF, p-value: 0.003391
Inferences in the ANOVA model
The OLS estimators are extremely simple.
\[\begin{align*} Q(\bs{\mu}) & = \sum_{i=1}^r\sum_{j=1}^{n_i}(Y_{ij}- \mu_i)^2 =\\[0.25cm] & = Q_1 + \ldots Q_r =\\[0.25cm] & = \sum_{j=1}^{n_1}(Y_{1j}- \mu_1)^2 + \ldots + \sum_{j=1}^{n_r}(Y_{rj}- \mu_r)^2 \end{align*}\]
\(\frac{dQ_i}{d\mu_i}=0 \iff \mu_i = \dfrac{\sum_{j=1}^{n_i}Y_{ij}}{n_i}=\overline{Y}_{i\bul}\)
The estimators are:
\(\hat{\mu}_i =\overline{Y}_{i\bul}\)
\(\hat{\mu} =\dfrac{\sum_{i=1}^{r}n_i\overline{Y}_{i\bul}}{n}=\overline{Y}_{\bul\bul}\)
\(\hat{\alpha}_i = \overline{Y}_{i\bul} - \overline{Y}_{\bul\bul}\)
From the ML estimator, we use
\(\hat{\sigma}^2 = MSE = \dfrac{SSE}{n-r}=\dfrac{\sum_{i=1}^r\sum_{j=1}^{n_i}\left(Y_{ij}- \overline{Y}_{i\bul}\right)^2}{n-r}\)
Q1. Does a factor influence the response variable?
We need to test
\(H_0: \mu_1=\ldots=\mu_r=\mu\) vs. \(H_1: \exists i,j : \mu_i\neq\mu_j\)
or
\(H_0: \alpha_1=\ldots=\alpha_r=0\) vs. \(H_1: \exists i : \alpha_i\neq 0\)
AS we have seen in the example, testing these hypotheses is equivalent to test the significance of the corresponding regression model, regardless of the chosen encoding.
Q2. How do different levels of the factor influence the response variable?
To handle this question we need to analyse the level means \(\mu_i\) or linear combinations of them, \(\sum_{i=1}^r c_i\mu_i\).
We have,
\[\overline{Y}_{i\bul} \sim N\left(\mu_i, \frac{\sigma^2}{n_i}\right)\text{ independent of }\frac{SSE}{\sigma^2}\sim \chi^2_{(n-r)}\]
And so,
\[\frac{\overline{Y}_{i\bul} - \mu_i}{\sqrt{\frac{MSE}{n_i}}}\sim t_{(n-r)}\]
and, more generally,
\[\frac{\sum_{i=1}^r c_i\overline{Y}_{i\bul} - \sum_{i=1}^r c_i\mu_i}{\sqrt{MSE\sum_{i=1}^r \frac{c_i^2}{n_i}}}\sim t_{(n-r)}\]
To compare the effects of different levels, usually more than one test or confidence interval is used \(\implies\) multiple comparisons
Adjust the confidence/significance levels using the Bonferroni method
Tukey’s Honest Significant Differences method
Provides all pairwise comparisons of factor level means:
\[(\overline{Y}_{i\bul}-\overline{Y}_{j\bul}) \pm \frac{1}{\sqrt{2}} F^{-1}_{SR(r, n-r)}(\gamma)\sqrt{MSE\left(\frac{1}{n_i}+\frac{1}{n_j}\right)}\]
where \(SR\) is the Studentized Range distribution.
Scheffé’s method
Provides a family of CI for all possible contrasts:
\[\sum_{i=1}^r c_i\overline{Y}_{i\bul} \pm \sqrt{(r-1) F^{-1}_{F(r-1, n-r)}(\gamma) MSE\sum_{i=1}^r \frac{c_i^2}{n_i}}\]
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = production ~ variety, data = wheat)
$variety
diff lwr upr p adj
2-1 4.353333 0.1556748 8.550992 0.0408629
3-1 -2.555000 -7.2052645 2.095264 0.4165311
4-1 1.445000 -3.2052645 6.095264 0.8072531
3-2 -6.908333 -11.3830524 -2.433614 0.0023558
4-2 -2.908333 -7.3830524 1.566386 0.2800552
4-3 4.000000 -0.9018091 8.901809 0.1301695
Let us consider now two factors, \(A\) and \(B\) with \(r\) and \(c\) levels.
Data:
\(Y_{ijk}\) – value of the response variable of the \(k^{th}\) observation for the \(i^{th}\) level of factor A and the \(j^{th}\) level of factor B, with \(i=1,\ldots, r\), \(j=1,\ldots, c\) and \(k=1,\ldots, n_{ij}\)
The ANOVA model
\[Y_{ijk} = \mu + \alpha_i + \beta_j+ \gamma_{ij} + E_{ijk}\]
with \(E_{ijk} \sim N(0, \sigma^2)\), uncorrelated.
Note
The model is clearly overparametrized. We have \(1+r+c+rc\) coefficients to model \(rc\) cell means.
Set \(\alpha_1=\beta_1=0\) and \(\gamma_{1j}=\gamma_{i1}=0,\ \forall i, j\)
Writing \(E[Y_{ijk}]=\mu_{i,j}\) we have:
\(\mu=\mu_{1,1}\)
\(\alpha_i=\mu_{i,1}-\mu_{1,1}\)
\(\beta_j=\mu_{1,j}-\mu_{1,1}\)
\(\gamma_{ij}=\mu_{i,j}-\mu_{i,1}-\mu_{1,j}+\mu_{1,1}\)
Again, the cells means model.
Set:
\(\sum_{i=1}^r n_{i\bul}\alpha_i=0\)
\(\sum_{j=1}^c n_{\bul j}\beta_j=0\)
\(\sum_{i=1}^r n_{ij}\gamma_{ij}=0,\ \forall j\)
\(\sum_{j=1}^c n_{ij}\gamma_{ij}=0,\ \forall i\)
Now, \(\mu=\frac{\sum_{i=1}^r \sum_{j=1}^c n_{ij}\mu_{ij}}{n}\) and the remaining parameters are deviations to this overall mean. The factor effects model.
Similarly to the previous case we have \(\hat{\mu}_{ij}=\overline{Y}_{ij\bul}\) and
\(\hat{\mu} =\overline{Y}_{\bul\bul\bul}\)
\(\hat{\alpha}_i = \overline{Y}_{i\bul\bul} - \overline{Y}_{\bul\bul\bul}\)
\(\hat{\beta}_j = \overline{Y}_{\bul j \bul} - \overline{Y}_{\bul\bul\bul}\)
\(\hat{\gamma}_{ij} = \overline{Y}_{ij\bul} -\overline{Y}_{i\bul\bul} -\overline{Y}_{\bul j \bul} + \overline{Y}_{\bul\bul\bul}\)
Q3. Is there significant interaction between the factors?
No
Data can be analysed separately for each factor.
Yes
The analysis of the factors’ main effects loses importance.
For balanced data, all previous tools apply but, in the unbalanced case, some results become dependent on the order of the factors or, more precisely, on the encoding used. This requires:
further considerations on the choice of encodings;
different decompositions of the SST other than the sequential one (Type I ANOVA). Those lead to Types II and III F-tests.