Linear regression
\(\def\bs#1{\boldsymbol{#1}}
\def\b#1{\mathbf{#1}}
\DeclareMathOperator{\diag}{diag}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator{\r}{r}
\DeclareMathOperator{\det}{det}\)
The multiple linear regression model
Observational data or data from a completely randomized design experiment:
\((y_i, \b{x}_i)_{i=1,\ldots,n}\) – all quantitative (for now . . . )
where \(\b{x}=(x_1,\ldots,x_k)\) – \(k\) covariates
The full rank Gauss-Markov model
\[\underset{n\times 1}{\b{y}} = \underset{n\times p}{\b{X}}\quad \underset{p\times 1}{\bs{\beta}}+\underset{n\times 1}{\b{e}}\]
with \(\b{e}\sim N_n(\b{0}, \sigma^2\b{I})\) and \(r(\b{X})=p\)
In an equivalent way,
\[\b{y}\sim N_n(\b{X}\bs{\beta}, \sigma^2\b{I})\]
Usually, the entry point to regression models is the 1st order model that simply includes all the covariates in the design matrix. If required, the inclusion of higher order terms can be considered:
| 1st order |
(1)-powers |
\(x_1,\ldots, x_{p-1}\) |
| 2nd order |
(2)-powers |
\(x_1^2,\ldots, x_{p-1}^2\) |
|
(1,1)-interactions |
\(x_1x_2,\ldots, x_1x_{p-1}, x_2x_3,\ldots, x_2x_{p-1},\ldots\) |
| 3rd order |
(3)-powers |
\(x_1^3,\ldots, x_{p-1}^3\) |
|
(1,2)-interactions |
\(x_1x_2^2,\ldots, x_1x_{p-1}^2, x_2x_3^2,\ldots, x_2x_{p-1}^2,\ldots\) |
|
(2,1)-interactions |
\(x_1^2x_2,\ldots, x_1^2x_{p-1}, x_2^2x_3,\ldots, x_2^2x_{p-1},\ldots\) |
| . . . |
. . . |
. . . |
Note
The usual interpretation of \(\beta_i\), \(i=1,\ldots, p-1\), as the variation in \(E[y\mid \b{x}]\) due to an unit increase in \(x_i\), keeping all other covariates fixed, that is \[E[y\mid x_1,\ldots, x_i=x+1,\ldots, x_{p-1}]-E[y\mid x_1,\ldots, x_i=x,\ldots, x_{p-1}]\] is only valid for the 1st order model.
For any other regression model, all interpretations come from its particular response surface:
\[E[y\mid \b{x}] = \b{x}\bs{\beta}\]
Estimation and prediction
OLS estimation
\[\text{Minimize}\ RSS(\bs{\beta}) = (\b{y}-\b{X}\bs{\beta})'(\b{y}-\b{X}\bs{\beta})\]
\(\frac{dRSS}{d\bs{\beta}}=0 \iff -2\b{X}'\b{y}+ 2\b{X}'\b{X}\bs{\beta}=0 \iff \b{X}'\b{X}\bs{\beta} = \b{X}'\b{y}\)
The matrix \(\b{X}'\b{X}\) is:
- symmetric
- full rank
- invertible
- positive definite
Therefore \[\bs{\beta}=(\b{X}'\b{X})^{-1}\b{X}'\b{y}\] is the unique solution.
The Hessian matrix is \[\frac{d^2RSS}{d\bs{\beta}d\bs{\beta}'}=2\b{X}'\b{X}\] which is positive definite and, so, the found solution is indeed a minimum.
Properties of the estimator
\[\hat{\bs{\beta}}=(\b{X}'\b{X})^{-1}\b{X}'\b{y}\]
Linear estimator
Unbiased estimator of \(\bs{\beta}\)
\(Var[\hat{\bs{\beta}}] = \sigma^2(\b{X}'\b{X})^{-1}\)
ML estimation
Maximizing \[\log \mathcal{L}(\bs{\beta},\sigma^2)=-\frac{n}{2}\log \sigma^2-\frac{1}{2\sigma^2}(\b{y}-\b{X}\bs{\beta})'(\b{y}-\b{X}\bs{\beta}) + C\] we get:
the same estimator for \(\bs{\beta}\)
\(\hat{\sigma}^2_{ML}=\frac{(\b{y}-\b{X}\hat{\bs{\beta}})'(\b{y}-\b{X}\hat{\bs{\beta}})}{n}=\frac{SSE}{n}\)
Residuals
\(\hat{\b{y}}=\hat{E}[\b{y}\mid \b{x}]=\b{X}\hat{\bs{\beta}}\) – the fitted values
Using the expression of the estimator we can write:
\[\hat{\b{y}}=\b{X}(\b{X}'\b{X})^{-1}\b{X}'\b{y}= \b{H}\b{y}\]
\(\b{H}_{n\times n}=\b{X}(\b{X}'\b{X})^{-1}\b{X}'\) is called the hat matrix or the projection matrix
\(\b{H}\b{X}=\b{X} \iff (\b{I}-\b{H})\b{X}=\b{0}\)
\(\b{X}'\b{H}=\b{X}' \iff \b{X}'(\b{I}-\b{H})=\b{0}\)
\(\b{H}\) and \(\b{I}-\b{H}\) are symmetric and idempotent matrices
\(r(\b{H})=tr(\b{H})=p\) and \(r(\b{I}-\b{H})=tr(\b{I}-\b{H})=n-p\)
A similar thing for the residuals:
\[\b{e}=\b{y}-\hat{\b{y}}=(\b{I}-\b{H})\b{y}\]
Some properties of the residuals
- \(E[\b{e}]=\b{0}\)
\(Var[\b{e}]=\sigma^2(\b{I}-\b{H})\)
\(\hat{\b{y}}'\b{e}=0\)
\(Cov[\hat{\b{y}},\b{e}] = \b{0}\)
The SSE in matrix form:
\[SSE=\b{e}'\b{e}=\b{y}'(\b{I}-\b{H})\b{y}\]
This implies that:
\[\frac{SSE}{\sigma^2}\sim \chi^2_{(n-p)}\]
Therefore, \(E[SSE]=\sigma^2(n-p)\) which shows that \(E[\hat{\sigma}^2_{ML}]=\frac{n-p}{n}\sigma^2\).
Hereafter, we will use the unbiased estimator \[\hat{\sigma}^2=\frac{SSE}{n-p}=MSE\]
Distributions of the estimators
\(\hat{\bs{\beta}} \sim N_p\left(\bs{\beta}, \sigma^2(\b{X}'\b{X})^{-1}\right)\)
\(\frac{(\hat{\bs{\beta}}-\bs{\beta})'\b{X}'\b{X}(\hat{\bs{\beta}}-\bs{\beta})}{\sigma^2}\sim \chi^2_{(p)}\)
\(\frac{SSE}{\sigma^2} = \frac{(n-p)\hat{\sigma}^2}{\sigma^2}\sim \chi^2_{(n-p)}\)
\(\hat{\bs{\beta}}\) and \(\hat{\sigma}^2\) are independent
The previous distributions provide a few more useful results.
\(\forall \b{c}\in\mathbb{R}^p :\)
\[\b{c}'\hat{\bs{\beta}}\sim N\left(\b{c}'\bs{\beta}, \sigma^2\b{c}'(\b{X}'\b{X})^{-1}\b{c}\right)\]
\[\implies \frac{\b{c}'\hat{\bs{\beta}} - \b{c}'\bs{\beta}}{\hat{\sigma}\sqrt{\b{c}'(\b{X}'\b{X})^{-1}\b{c}}}\sim t_{(n-p)}\]
Particular case 1
For \(\b{c}' = (0,\ldots, 0, 1, 0, \ldots, 0)\):
\[\frac{\hat{\beta}_j-\beta_j}{se(\hat{\beta}_j)}\sim t_{(n-p)}\]
with \(se(\hat{\beta}_j)=\hat{\sigma}\sqrt{(\b{X}'\b{X})^{-1}_{jj}}\)
Particular case 2
For \(y_0=E[y\mid \b{x}=\b{x}_0] = \b{x}'_0\bs{\beta}\):
\[\frac{\b{x}'_0\hat{\bs{\beta}}-\b{x}'_0\bs{\beta}}{\hat{\sigma}\sqrt{\b{x}'_0 (\b{X}'\b{X})^{-1} \b{x}_0}}\sim t_{(n-p)}\]
Prediction
The goal is to predict a new observation under the model, that is
\[y_u\sim N(\b{x}'_u\bs{\beta}, \sigma^2)\]
independent from observed data.
The estimator of its expected value, \(\hat{y}_u=\b{x}'_u\hat{\bs{\beta}}\), is now called a predictor.
Considering the prediction error, \(\hat{y}_u-y_u\), we have:
- \(E[\hat{y}_u-y_u]=0\)
\(Var[\hat{y}_u-y_u]=\sigma^2\left(1+\b{x}'_u(\b{X}'\b{X})^{-1}\b{x}_u\right)\)
\(\dfrac{\hat{y}_u-y_u}{se(\hat{y}_u-y_u)}\sim t_{(n-p)}\)
where \(se(\hat{y}_u-y_u)=\hat{\sigma}\sqrt{1+\b{x}'_u(\b{X}'\b{X})^{-1}\b{x}_u}\)
Another result
From (2) to (4) we have:
\[\frac{n-p}{p}\frac{(\hat{\bs{\beta}}-\bs{\beta})'\b{X}'\b{X}(\hat{\bs{\beta}}-\bs{\beta})}{SSE}\sim F_{(p, n-p)}\]
To analyse the demand of some product it was recorded the mean levels of education and income and also a measure of urbanization for each of 9 regions.
demand education income urbanization
1 167 11.2 31.9 42.2
2 174 10.6 13.2 48.6
3 161 10.6 28.7 42.6
4 162 10.4 26.1 39.0
5 141 9.3 30.1 34.7
6 175 10.8 8.5 44.5
7 164 10.7 24.3 39.1
8 174 10.0 18.6 40.1
9 186 12.0 20.4 45.9
model <- lm(demand ~ education, data=product)
summary(model)
Call:
lm(formula = demand ~ education, data = product)
Residuals:
Min 1Q Median 3Q Max
-9.06 -5.98 -2.17 5.22 15.53
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.9 43.7 0.66 0.530
education 13.0 4.1 3.17 0.016 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.66 on 7 degrees of freedom
Multiple R-squared: 0.59, Adjusted R-squared: 0.531
F-statistic: 10.1 on 1 and 7 DF, p-value: 0.0157
How to interpret the following results?
ci <- confint(model, level = 0.96)
ci
2 % 98 %
(Intercept) -81.05 138.7
education 2.69 23.3
These results where obtained from
\[\frac{\hat{\bs{\beta}}_j-\bs{\beta}_j}{se(\hat{\bs{\beta}}_j)}\sim t_{(n-p)}\] with \(se(\hat{\bs{\beta}}_j)=\hat{\sigma}\sqrt{(\b{X}'\b{X})^{-1}_{jj}}\).
We should be careful using those results together because the estimators are not independent and we may lose track of the confidence or significance levels when using them simultaneously.
The problem of simultaneous estimation
What can we say about the confidence level of the following statement?
\[(\beta_0, \beta_1)\in [-81.053, 138.749]\times [2.689, 23.336]\]
Using the distributions of the estimators, we have
\[T(\b{y})=\frac{n-p}{p}\frac{(\hat{\bs{\beta}}-\bs{\beta})'\b{X}'\b{X}(\hat{\bs{\beta}}-\bs{\beta})}{SSE}\sim F_{(p, n-p)}\]
and so, the region
\[\left\{\bs{\beta}\in\mathbb{R}^p : T(\b{y})\leq F^{-1}_{F_{(p, n-p)}}(\gamma)\right\}\]
is a \(100\gamma\%\) joint confidence region for \(\bs{\beta}\).
A possible correction
For any events \(A_1,\ldots,A_n\) we have \[P\left(\bigcap_{i=1}^n A_i\right)\geq 1-\sum_{i=1}^n P(\bar{A}_i).\]
Let \(C_i(\b{y})\) be a confidence interval for \(\beta_i\) with a \(\gamma_i\) confidence level.
Then, \[\begin{align*}
P\left(\bs{\beta}\in \bigcap_{i=0}^{p-1} C_i(\b{y})\right) & \geq 1-\sum_{i=0}^{p-1} P(\beta_i\not\in C_i(\b{y}) )=\\
& = 1-\sum_{i=0}^{p-1} (1-\gamma_i)=\\
& = 1-p(1-\gamma), \text{ if } \gamma_i=\gamma, \forall i
\end{align*}\]
Application
Inferences for the mean response
From
\[\frac{\b{x}'_0\hat{\bs{\beta}}-\b{x}'_0\bs{\beta}}{se(\b{x}'_0\hat{\bs{\beta}})}\sim t_{(n-p)}\]
a \(100\gamma\%\) confidence interval is given by
\[\b{x}'_0\hat{\bs{\beta}}\pm F_{t_{(n-p)}}^{-1}\left(\frac{1+\gamma}{2}\right)\times se(\b{x}'_0\hat{\bs{\beta}})\]
As referred before, for simultaneous inferences, a correction is required:
the Bonferroni method for \(k\) intervals
the Working-Hotelling method for all possible intervals:
\[\b{x}'_0\hat{\bs{\beta}}\pm W \times se(\b{x}'_0\hat{\bs{\beta}})\]
with \(W^2 = p F_{F_{(p, n-p)}}^{-1}\left(\gamma\right)\)
96% confidence bands for the mean response
As before, corrections are also required when considering more than one prediction interval:
the Bonferroni method
the Scheffé method for \(k\) intervals
\[\hat{y}_u\pm S \times se(\hat{y}_u-y_u)\]
with \(S^2 = k F_{F_{(k, n-p)}}^{-1}\left(\gamma\right)\)
96% prediction bands
R package: api2lm
Tests for the regression coefficients
The significance of the regression model
Let us consider the hypotheses
\[H_0: \forall i>0 : \beta_i = 0\] versus \[H_1: \exists i>0 : \beta_i \neq 0\]
Notes
\(H_0\) means that the set of terms is not capable of explaining \(y\) under the model or
\[H_0: E[y\mid \b{X}] = \beta_0\]
- The intercept \(\beta_0\) is required. Otherwise, most of this section wouldn’t make sense!
The test procedure is based on the following decomposition of deviation:
\[y_i-\bar{y} = (y_i-\hat{y}_i) + (\hat{y}_i-\bar{y})\]
We define:
\(SST = \sum\left(y_i-\bar{y}\right)^2\)
\(SSE = \sum\left(y_i-\hat{y}_i\right)^2\)
\(SSR = \sum\left(\hat{y}_i-\bar{y}\right)^2\)
We have already seen that
\[SSE=\b{y}'(\b{I}-\b{H})\b{y}\]
and now:
\[SSR=\b{y}'\left(\b{H}-\frac{1}{n}\b{J}\right)\b{y}\]
\[SST=\b{y}'\left(\b{I}-\frac{1}{n}\b{J}\right)\b{y}\]
with \(\b{J}=\b{1}\b{1}'\)
In matrix notation, it is clear that:
- \(SST = SSE + SSR\)
- all 3 quantities are quadratic forms of a multivariate gaussian r. v. \(\implies\) non-central \(\chi^2\) distributions
\(\frac{SSE}{\sigma^2}\sim \chi^2_{(n-p)}\) (already stated)
Under \(H_0\)\(: \frac{SST}{\sigma^2}\sim \chi^2_{(n-1)}\) (known result under i.i.d.)
Under \(H_0\): \(\frac{SSR}{\sigma^2}\sim \chi^2_{(p-1)}\)
\(SSR\) and \(SSE\) are independent (show that \((\b{I}-\b{H})\left(\b{H}-\frac{1}{n}\b{J}\right)=\b{0}\))
And so, to test the hypotheses we can use the overall F-test with the statistic:
\[F=\frac{\frac{SSR}{p-1}}{\frac{SSE}{n-p}} = \frac{MSR}{MSE}\stackrel{H_0}{\sim}F_{(p-1, n-p)}\]
\(H_0\) can be rejected if \(F_o > F^{-1}_{F_{(p-1, n-p)}}(1-\alpha)\)
Related to the previous test we already know
\[R^2=\frac{SSR}{SST}=1-\frac{SSE}{SST}\] that is now called the multiple coefficient of determination.
Note that:
\(SST\) is fixed and does not depend on the model
\(SSR\) can not become smaller if we expand a model
\(\therefore\) \(R^2\) never decreases with the inclusion of terms in any model!
Because of this \(R^2\) inflation, it should be preferred the
Adjusted coefficient of determination
\[R^2_a=1-\frac{MSE}{MST}=1-\frac{(n-1) SSE}{(n-p) SST}\]
Notes
\(R^2_a\) penalizes the inclusion of terms
The interpretation is lost
\(R^2_a\) may be negative, unlike \(R^2\)
The general linear hypotheses
Many hypotheses on the regression coefficients can be written in the generic form
\[H_0: \b{C}\bs{\beta}= \b{m}\]
versus
\[H_1: \b{C}\bs{\beta}\neq \b{m}\]
where \(\b{C}_{(c\times p)}\) and \(\b{m}_{(c\times 1)}\) are constant, with \(r(\b{C})=c\leq p\), and the linear system \(\b{C}\bs{\beta}= \b{m}\) has at least one solution.
\(H_0: \beta_k=0\)
\(\b{C}=(0, \ldots, 0, 1, 0, \ldots, 0)\), \(\b{m} = (0)\) and \(r(\b{C})=1\)
\(H_0: \bs{\beta} = \b{0}_{(p)}\)
\(\b{C}=\b{I}_{(p)}\), \(\b{m} = \b{0}_{(p)}\) and \(r(\b{C})=p\)
\(H_0: \forall i>0 : \beta_i = 0\)
\(\b{C}=(\b{0}_{(p-1)}\ \b{I}_{(p-1)})\), \(\b{m} = \b{0}_{(p-1)}\) and \(r(\b{C})=p-1\)
Under \(H_0\):
\[\b{C}\hat{\bs{\beta}} - \b{m} \sim N_c\left(\b{0}, \sigma^2\b{C}(\b{X}'\b{X})^{-1}\b{C}'\right)\]
With \(Q=\left(\b{C}\hat{\bs{\beta}} - \b{m}\right)'\left(\b{C}(\b{X}'\b{X})^{-1}\b{C}'\right)^{-1}\left(\b{C}\hat{\bs{\beta}} - \b{m}\right)\) we have
\(\dfrac{Q}{\sigma^2} \stackrel{H_0}{\sim} \chi^2_{(c)}\)
\(F=\dfrac{n-p}{c}\dfrac{Q}{SSE} \stackrel{H_0}{\sim} F_{(c, n-p)}\)
and \(H_0\) can be rejected if \(F_o > F^{-1}_{F_{(c, n-p)}}(1-\alpha)\).
This framework provides a way to compare a larger model (L) with a reduced one (R) defined by the restritions.
A model R (reduced model) is said nested in a model L (larger model) if the set of terms of R is a subset or a linear subspace of the set of terms of L.
For nested models:
\[Q = SSE(R) - SSE(L)\]
and the F-statistic can be written in a more practical way:
\[F = \frac{SSE(R) - SSE(L)}{SSE(L)}\frac{df_L}{df_R-df_L} \stackrel{H_0}{\sim}F_{(df_R-df_F, df_F)}\]
Note
Application of this general framework to the initial examples of hypotheses would produce exactly the previous tests. For the univariate t-tests we would have \(F_o =T_o^2\).
The tests to compare nested models are usually called partial F-tests.
model <- lm(demand ~ ., data=product)
summary(model)
Call:
lm(formula = demand ~ ., data = product)
Residuals:
1 2 3 4 5 6 7 8 9
0.876 -0.972 -1.492 0.762 -4.612 -5.462 -2.129 11.653 1.377
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.014 36.191 1.66 0.158
education 10.718 4.530 2.37 0.064 .
income -0.751 0.395 -1.90 0.116
urbanization 0.240 1.012 0.24 0.822
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.29 on 5 degrees of freedom
Multiple R-squared: 0.845, Adjusted R-squared: 0.753
F-statistic: 9.11 on 3 and 5 DF, p-value: 0.0181
reduced_model <- update(model, . ~ . - urbanization)
summary(reduced_model)
Call:
lm(formula = demand ~ education + income, data = product)
Residuals:
Min 1Q Median 3Q Max
-5.873 -2.730 0.065 1.111 11.481
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.021 31.114 2.03 0.089 .
education 11.517 2.777 4.15 0.006 **
income -0.816 0.261 -3.12 0.021 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.77 on 6 degrees of freedom
Multiple R-squared: 0.844, Adjusted R-squared: 0.791
F-statistic: 16.2 on 2 and 6 DF, p-value: 0.00383
anova(reduced_model, model)
Analysis of Variance Table
Model 1: demand ~ education + income
Model 2: demand ~ education + income + urbanization
Res.Df RSS Df Sum of Sq F Pr(>F)
1 6 200
2 5 198 1 2.22 0.06 0.82
null_model <- lm(demand ~ 1, data=product)
anova(null_model, reduced_model)
Analysis of Variance Table
Model 1: demand ~ 1
Model 2: demand ~ education + income
Res.Df RSS Df Sum of Sq F Pr(>F)
1 8 1279
2 6 200 2 1079 16.2 0.0038 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Extra sums of squares
Each regression model \(M\) leads to a particular decomposition of the SST: \[SST = SSR(M) + SSE(M)\]
For the null model we have SSE(\(\emptyset\)) = SST. The inclusion of terms will increase the SSR while decreasing the SSE.
To identify significant terms it may be useful to analyse their contributions to this variation transfer.
To do that we define partial or extra sums of squares as
\[SSR(L) - SSR(R) = SSE(R) - SSE(L)\]
The more useful extra sums of squares are the sequential ones:
\[SSR(x_{j+1}\mid x_1,\ldots,x_{j}) = SSR(x_1,\ldots,x_{j+1}) - SSR(x_1,\ldots,x_{j})\]
Another way of using the extra sums of squares is
\(SSR(x_1,\ldots,x_{p-1})=\)
\(\quad\quad = SSR(x_1,\ldots,x_{j}) + SSR(x_{j+1}, \ldots, x_{p-1}\mid x_1,\ldots,x_{j})\)
This decomposition is not unique and can be applied sequentially. For example, in a model with 3 covariates we can have:
\(SSR(x_1,x_2,x_3)=SSR(x_1)+SSR(x_2\mid x_1)+SSR(x_3\mid x_1,x_2)\)
\(SSR(x_1,x_2,x_3)=SSR(x_2)+SSR(x_3\mid x_2)+SSR(x_1\mid x_2,x_3)\)
\(SSR(x_1,x_2,x_3)=SSR(x_1)+SSR(x_2, x_3\mid x_1)\)
The extra sums of squares are also useful to define the coefficients of partial determination.
For example,
\[R^2_{3\mid 1,2}=\frac{SSR(x_3\mid x_1, x_2)}{SSE(x_1, x_2)}\]
is the coefficient of partial determination between \(y\) and \(x_3\), given that \(x_1\) and \(x_2\) are in the model, and it is interpreted as the proportion of variation not explained by the reduced model that can be explained by the larger model.
Finally, we can consider a sequence of partial F-tests.
Analysis of Variance Table
Response: demand
Df Sum Sq Mean Sq F value Pr(>F)
education 1 754 754 19.06 0.0072 **
income 1 325 325 8.21 0.0352 *
urbanization 1 2 2 0.06 0.8221
Residuals 5 198 40
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The statistic here is a variation of previous ones:
\[F^*=\frac{(n-p)}{c}\frac{SSR(x_{j+1}, \ldots, x_{j+c}\mid x_1,\ldots,x_{j})}{SSE(x_1,\ldots,x_{p-1})}\stackrel{H_0}{\sim}F_{(c, n-p)}\]
Another view of the previous ANOVA table with added information:
m0 <- lm(demand ~ 1, data= product)
m1 <- update(m0, . ~ . + education)
m2 <- update(m1, . ~ . + income)
m3 <- update(m2, . ~ . + urbanization)
anova(m0, m1, m2, m3)
Analysis of Variance Table
Model 1: demand ~ 1
Model 2: demand ~ education
Model 3: demand ~ education + income
Model 4: demand ~ education + income + urbanization
Res.Df RSS Df Sum of Sq F Pr(>F)
1 8 1279
2 7 525 1 754 19.06 0.0072 **
3 6 200 1 325 8.21 0.0352 *
4 5 198 1 2 0.06 0.8221
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Categorical covariates
To consider the inclusion of categorical covariates in a regression model it is particularly useful to look at the response surface:
\[E[Y\mid \b{x}] = \b{x}'\bs{\beta}\]
Binary covariates (2 levels)
Suppose that \(p=3\) and \(x_2\) is binary (0/1). The response surface for the first-order model is:
\[\begin{align*}
E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_2 x_2 =\\[0.25cm]
& = \begin{cases}\beta_0+\beta_1 x_1, & x_2=0\\[0.25cm](\beta_0 + \beta_2)+\beta_1 x_1, & x_2=1\end{cases}
\end{align*}\]
\(\leadsto\) we are fitting 2 parallel response surfaces!
Note
With \(q\) binary covariates we would have \(2^q\) parallel response surfaces.
\(E[y\mid x_1, x_2=1] - E[y\mid x_1, x_2=0] = \beta_{2}\)
To allow non-parallel surfaces we need to include interactions:
\[\begin{align*}
E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 =\\[0.25cm]
& = \begin{cases}\beta_0+\beta_1 x_1, & x_2=0\\[0.25cm](\beta_0 + \beta_2)+(\beta_1 + \beta_3) x_1, & x_2=1\end{cases}
\end{align*}\]
\(E[y\mid x_1, x_2=1] - E[y\mid x_1, x_2=0] = \beta_{2} + \beta_3 x_1\)
More than 2 levels
Suppose now that \(x_2\) is a categorical variable or factor with 3 levels. If we code the levels as 0, 1 and 2 we have:
\[\begin{align*}
E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_2 x_2 =\\[0.25cm]
& = \begin{cases}\beta_0+\beta_1 x_1, & x_2=0\\[0.25cm](\beta_0 + \beta_2)+\beta_1 x_1, & x_2=1\\[0.25cm](\beta_0 + 2\beta_2)+\beta_1 x_1, & x_2=2\end{cases}
\end{align*}\]
Not only we have parallel response surfaces but they are also equally distant and ordered. This is, in general, undesirable.
We could associate a binary variable to each level of the factor \(x_2\rightsquigarrow (x_{2,0}, x_{2,1}, x_{2,2})\) but this would lead to an incomplete rank linear model.
Since \(x_2\) has 3 levels we will use only 2 indicator or dummy variables:
This is called a reference level encoding and the level for which \(\b{x}=\b{0}\) is called the reference level.
\[\begin{align*}
E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_{2,1} x_{2,1} + \beta_{2,2} x_{2,2} =\\[0.25cm]
& = \begin{cases}\beta_0+\beta_1 x_1, & x_2=0\\[0.25cm](\beta_0 + \beta_{2,1})+\beta_1 x_1, & x_2=1\\[0.25cm](\beta_0 + \beta_{2,2})+\beta_1 x_1, & x_2=2\end{cases}
\end{align*}\]
What can we do with these dummies?
\(E[y\mid x_1, x_2=i] - E[y\mid x_1, x_2=0] = \beta_{2,i},\ i=1,2\)
Again, the inclusion of interactions will remove the parallelism.
\[\begin{align*}
E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_{2,1} x_{2,1} + \beta_{2,2} x_{2,2} + \\[0.25cm]
& + \beta_{3} x_1 x_{2,1} + \beta_4 x_1 x_{2,2} \\[0.25cm]
& = \begin{cases}\beta_0+\beta_1 x_1, & x_2=0\\[0.25cm](\beta_0 + \beta_{2,1})+(\beta_1 + \beta_3) x_1, & x_2=1\\[0.25cm](\beta_0 + \beta_{2,2}) + (\beta_1 + \beta_4) x_1, & x_2=2\end{cases}
\end{align*}\]
Notes
- Instead of the binary encoding in the dummy variables, other values (contrasts) are also used that provide different interpretations for the regression parameters.
- If all covariates are categorical we will use Analysis of Variance models (ANOVA).
Another application of categorical variables
Piecewise linear regression
Suppose \(x_1\) be a quantitative covariate and let \(x_2=I_{[c, +\infty[}(x_1)\).
Consider
\[\begin{align*}
E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_2 (x_1-c)x_2 =\\[0.25cm]
& = \begin{cases}\beta_0+\beta_1 x_1, & x_1<c\\[0.25cm](\beta_0 - c \beta_2)+(\beta_1 + \beta_2) x_1, & x_1\geq c\end{cases}
\end{align*}\]
\(\leadsto\) continuous piecewise regression model
Consider
\[\begin{align*}
E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1-c)x_2 =\\[0.25cm]
& = \begin{cases}\beta_0+\beta_1 x_1, & x_1<c\\[0.25cm](\beta_0 + \beta_2 - c \beta_3)+(\beta_1 + \beta_3) x_1, & x_1\geq c\end{cases}
\end{align*}\]
\(\leadsto\) discontinuous piecewise regression model
fullscreen
water_drop
water_drop
water_drop
water_drop
water_drop
water_drop
stylus_note
ink_eraser
delete_forever
content_paste
navigate_before
navigate_next
draw