2 Linear regression

\(\def\bs#1{\boldsymbol{#1}} \def\b#1{\mathbf{#1}} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\r}{r} \DeclareMathOperator{\det}{det}\)

2.1 The multiple linear regression model

Observational data or data from a completely randomized design experiment:

\((y_i, \b{x}_i)_{i=1,\ldots,n}\) – all quantitative (for now . . . )

where \(\b{x}=(x_1,\ldots,x_k)\)\(k\) covariates

The full rank Gauss-Markov model

\[\underset{n\times 1}{\b{y}} = \underset{n\times p}{\b{X}}\quad \underset{p\times 1}{\bs{\beta}}+\underset{n\times 1}{\b{e}}\]

with \(\b{e}\sim N_n(\b{0}, \sigma^2\b{I})\) and \(r(\b{X})=p\)

In an equivalent way,

\[\b{y}\sim N_n(\b{X}\bs{\beta}, \sigma^2\b{I})\]

From hereafter:

Usually, the entry point to regression models is the 1st order model that simply includes all the covariates in the design matrix. If required, the inclusion of higher order terms can be considered:

1st order (1)-powers \(x_1,\ldots, x_{p-1}\)
2nd order (2)-powers \(x_1^2,\ldots, x_{p-1}^2\)
(1,1)-interactions \(x_1x_2,\ldots, x_1x_{p-1}, x_2x_3,\ldots, x_2x_{p-1},\ldots\)
3rd order (3)-powers \(x_1^3,\ldots, x_{p-1}^3\)
(1,2)-interactions \(x_1x_2^2,\ldots, x_1x_{p-1}^2, x_2x_3^2,\ldots, x_2x_{p-1}^2,\ldots\)
(2,1)-interactions \(x_1^2x_2,\ldots, x_1^2x_{p-1}, x_2^2x_3,\ldots, x_2^2x_{p-1},\ldots\)
. . . . . . . . .

Note

The usual interpretation of \(\beta_i\), \(i=1,\ldots, p-1\), as the variation in \(E[y\mid \b{x}]\) due to an unit increase in \(x_i\), keeping all other covariates fixed, that is \[E[y\mid x_1,\ldots, x_i=x+1,\ldots, x_{p-1}]-E[y\mid x_1,\ldots, x_i=x,\ldots, x_{p-1}]\] is only valid for the 1st order model.

For any other regression model, all interpretations come from its particular response surface:

\[E[y\mid \b{x}] = \b{x}\bs{\beta}\]

2.2 Estimation and prediction

OLS estimation

\[\text{Minimize}\ RSS(\bs{\beta}) = (\b{y}-\b{X}\bs{\beta})'(\b{y}-\b{X}\bs{\beta})\]

\(\frac{dRSS}{d\bs{\beta}}=0 \iff -2\b{X}'\b{y}+ 2\b{X}'\b{X}\bs{\beta}=0 \iff \b{X}'\b{X}\bs{\beta} = \b{X}'\b{y}\)

The matrix \(\b{X}'\b{X}\) is:

  1. symmetric
  1. full rank
  2. invertible
  3. positive definite

Therefore \[\bs{\beta}=(\b{X}'\b{X})^{-1}\b{X}'\b{y}\] is the unique solution.

The Hessian matrix is \[\frac{d^2RSS}{d\bs{\beta}d\bs{\beta}'}=2\b{X}'\b{X}\] which is positive definite and, so, the found solution is indeed a minimum.

Properties of the estimator

\[\hat{\bs{\beta}}=(\b{X}'\b{X})^{-1}\b{X}'\b{y}\]

  1. Linear estimator

  2. Unbiased estimator of \(\bs{\beta}\)

  3. \(Var[\hat{\bs{\beta}}] = \sigma^2(\b{X}'\b{X})^{-1}\)

ML estimation

Maximizing \[\log \mathcal{L}(\bs{\beta},\sigma^2)=-\frac{n}{2}\log \sigma^2-\frac{1}{2\sigma^2}(\b{y}-\b{X}\bs{\beta})'(\b{y}-\b{X}\bs{\beta}) + C\] we get:

  1. the same estimator for \(\bs{\beta}\)

  2. \(\hat{\sigma}^2_{ML}=\frac{(\b{y}-\b{X}\hat{\bs{\beta}})'(\b{y}-\b{X}\hat{\bs{\beta}})}{n}=\frac{SSE}{n}\)

Residuals

\(\hat{\b{y}}=\hat{E}[\b{y}\mid \b{x}]=\b{X}\hat{\bs{\beta}}\) – the fitted values

Using the expression of the estimator we can write:

\[\hat{\b{y}}=\b{X}(\b{X}'\b{X})^{-1}\b{X}'\b{y}= \b{H}\b{y}\]

\(\b{H}_{n\times n}=\b{X}(\b{X}'\b{X})^{-1}\b{X}'\) is called the hat matrix or the projection matrix

Theorem

  1. \(\b{H}\b{X}=\b{X} \iff (\b{I}-\b{H})\b{X}=\b{0}\)

  2. \(\b{X}'\b{H}=\b{X}' \iff \b{X}'(\b{I}-\b{H})=\b{0}\)

  3. \(\b{H}\) and \(\b{I}-\b{H}\) are symmetric and idempotent matrices

  4. \(r(\b{H})=tr(\b{H})=p\) and \(r(\b{I}-\b{H})=tr(\b{I}-\b{H})=n-p\)

A similar thing for the residuals:

\[\b{e}=\b{y}-\hat{\b{y}}=(\b{I}-\b{H})\b{y}\]

Some properties of the residuals

  1. \(E[\b{e}]=\b{0}\)
  1. \(Var[\b{e}]=\sigma^2(\b{I}-\b{H})\)

  2. \(\hat{\b{y}}'\b{e}=0\)

  3. \(Cov[\hat{\b{y}},\b{e}] = \b{0}\)

The SSE in matrix form:

\[SSE=\b{e}'\b{e}=\b{y}'(\b{I}-\b{H})\b{y}\]

This implies that:

\[\frac{SSE}{\sigma^2}\sim \chi^2_{(n-p)}\]

Therefore, \(E[SSE]=\sigma^2(n-p)\) which shows that \(E[\hat{\sigma}^2_{ML}]=\frac{n-p}{n}\sigma^2\).

Hereafter, we will use the unbiased estimator \[\hat{\sigma}^2=\frac{SSE}{n-p}=MSE\]

Distributions of the estimators

  1. \(\hat{\bs{\beta}} \sim N_p\left(\bs{\beta}, \sigma^2(\b{X}'\b{X})^{-1}\right)\)

  2. \(\frac{(\hat{\bs{\beta}}-\bs{\beta})'\b{X}'\b{X}(\hat{\bs{\beta}}-\bs{\beta})}{\sigma^2}\sim \chi^2_{(p)}\)

  3. \(\frac{SSE}{\sigma^2} = \frac{(n-p)\hat{\sigma}^2}{\sigma^2}\sim \chi^2_{(n-p)}\)

  4. \(\hat{\bs{\beta}}\) and \(\hat{\sigma}^2\) are independent

The previous distributions provide a few more useful results.

\(\forall \b{c}\in\mathbb{R}^p :\)

\[\b{c}'\hat{\bs{\beta}}\sim N\left(\b{c}'\bs{\beta}, \sigma^2\b{c}'(\b{X}'\b{X})^{-1}\b{c}\right)\]

\[\implies \frac{\b{c}'\hat{\bs{\beta}} - \b{c}'\bs{\beta}}{\hat{\sigma}\sqrt{\b{c}'(\b{X}'\b{X})^{-1}\b{c}}}\sim t_{(n-p)}\]

Particular case 1

For \(\b{c}' = (0,\ldots, 0, 1, 0, \ldots, 0)\):

\[\frac{\hat{\beta}_j-\beta_j}{se(\hat{\beta}_j)}\sim t_{(n-p)}\]

with \(se(\hat{\beta}_j)=\hat{\sigma}\sqrt{(\b{X}'\b{X})^{-1}_{jj}}\)

Particular case 2

For \(y_0=E[y\mid \b{x}=\b{x}_0] = \b{x}'_0\bs{\beta}\):

\[\frac{\b{x}'_0\hat{\bs{\beta}}-\b{x}'_0\bs{\beta}}{\hat{\sigma}\sqrt{\b{x}'_0 (\b{X}'\b{X})^{-1} \b{x}_0}}\sim t_{(n-p)}\]

Prediction

The goal is to predict a new observation under the model, that is

\[y_u\sim N(\b{x}'_u\bs{\beta}, \sigma^2)\]

independent from observed data.

The estimator of its expected value, \(\hat{y}_u=\b{x}'_u\hat{\bs{\beta}}\), is now called a predictor.

Considering the prediction error, \(\hat{y}_u-y_u\), we have:

  1. \(E[\hat{y}_u-y_u]=0\)
  1. \(Var[\hat{y}_u-y_u]=\sigma^2\left(1+\b{x}'_u(\b{X}'\b{X})^{-1}\b{x}_u\right)\)

  2. \(\dfrac{\hat{y}_u-y_u}{se(\hat{y}_u-y_u)}\sim t_{(n-p)}\)

    where \(se(\hat{y}_u-y_u)=\hat{\sigma}\sqrt{1+\b{x}'_u(\b{X}'\b{X})^{-1}\b{x}_u}\)

Another result

From (2) to (4) we have:

\[\frac{n-p}{p}\frac{(\hat{\bs{\beta}}-\bs{\beta})'\b{X}'\b{X}(\hat{\bs{\beta}}-\bs{\beta})}{SSE}\sim F_{(p, n-p)}\]

To analyse the demand of some product it was recorded the mean levels of education and income and also a measure of urbanization for each of 9 regions.

  demand education income urbanization
1    167      11.2   31.9         42.2
2    174      10.6   13.2         48.6
3    161      10.6   28.7         42.6
4    162      10.4   26.1         39.0
5    141       9.3   30.1         34.7
6    175      10.8    8.5         44.5
7    164      10.7   24.3         39.1
8    174      10.0   18.6         40.1
9    186      12.0   20.4         45.9

model <- lm(demand ~ education, data=product)
summary(model)

Call:
lm(formula = demand ~ education, data = product)

Residuals:
   Min     1Q Median     3Q    Max 
 -9.06  -5.98  -2.17   5.22  15.53 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)     28.9       43.7    0.66    0.530  
education       13.0        4.1    3.17    0.016 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.66 on 7 degrees of freedom
Multiple R-squared:  0.59,  Adjusted R-squared:  0.531 
F-statistic: 10.1 on 1 and 7 DF,  p-value: 0.0157

How to interpret the following results?

ci <- confint(model, level = 0.96)
ci
               2 %  98 %
(Intercept) -81.05 138.7
education     2.69  23.3

These results where obtained from

\[\frac{\hat{\bs{\beta}}_j-\bs{\beta}_j}{se(\hat{\bs{\beta}}_j)}\sim t_{(n-p)}\] with \(se(\hat{\bs{\beta}}_j)=\hat{\sigma}\sqrt{(\b{X}'\b{X})^{-1}_{jj}}\).

We should be careful using those results together because the estimators are not independent and we may lose track of the confidence or significance levels when using them simultaneously.

The problem of simultaneous estimation

What can we say about the confidence level of the following statement?

\[(\beta_0, \beta_1)\in [-81.053, 138.749]\times [2.689, 23.336]\]

Using the distributions of the estimators, we have

\[T(\b{y})=\frac{n-p}{p}\frac{(\hat{\bs{\beta}}-\bs{\beta})'\b{X}'\b{X}(\hat{\bs{\beta}}-\bs{\beta})}{SSE}\sim F_{(p, n-p)}\]

and so, the region

\[\left\{\bs{\beta}\in\mathbb{R}^p : T(\b{y})\leq F^{-1}_{F_{(p, n-p)}}(\gamma)\right\}\]

is a \(100\gamma\%\) joint confidence region for \(\bs{\beta}\).

A possible correction

Boole’s (or Bonferroni’s) inequality For any events \(A_1,\ldots,A_n\) we have \[P\left(\bigcap_{i=1}^n A_i\right)\geq 1-\sum_{i=1}^n P(\bar{A}_i).\]

Let \(C_i(\b{y})\) be a confidence interval for \(\beta_i\) with a \(\gamma_i\) confidence level.

Then, \[\begin{align*} P\left(\bs{\beta}\in \bigcap_{i=0}^{p-1} C_i(\b{y})\right) & \geq 1-\sum_{i=0}^{p-1} P(\beta_i\not\in C_i(\b{y}) )=\\ & = 1-\sum_{i=0}^{p-1} (1-\gamma_i)=\\ & = 1-p(1-\gamma), \text{ if } \gamma_i=\gamma, \forall i \end{align*}\]

Application

Inferences for the mean response

From

\[\frac{\b{x}'_0\hat{\bs{\beta}}-\b{x}'_0\bs{\beta}}{se(\b{x}'_0\hat{\bs{\beta}})}\sim t_{(n-p)}\]

a \(100\gamma\%\) confidence interval is given by

\[\b{x}'_0\hat{\bs{\beta}}\pm F_{t_{(n-p)}}^{-1}\left(\frac{1+\gamma}{2}\right)\times se(\b{x}'_0\hat{\bs{\beta}})\]

As referred before, for simultaneous inferences, a correction is required:

  1. the Bonferroni method for \(k\) intervals

  2. the Working-Hotelling method for all possible intervals:

    \[\b{x}'_0\hat{\bs{\beta}}\pm W \times se(\b{x}'_0\hat{\bs{\beta}})\]

    with \(W^2 = p F_{F_{(p, n-p)}}^{-1}\left(\gamma\right)\)

96% confidence bands for the mean response

As before, corrections are also required when considering more than one prediction interval:

  1. the Bonferroni method

  2. the Scheffé method for \(k\) intervals

    \[\hat{y}_u\pm S \times se(\hat{y}_u-y_u)\]

    with \(S^2 = k F_{F_{(k, n-p)}}^{-1}\left(\gamma\right)\)

96% prediction bands

R package: api2lm

2.3 Tests for the regression coefficients

The significance of the regression model

Let us consider the hypotheses

\[H_0: \forall i>0 : \beta_i = 0\] versus \[H_1: \exists i>0 : \beta_i \neq 0\]

Notes

  1. \(H_0\) means that the set of terms is not capable of explaining \(y\) under the model or

    \[H_0: E[y\mid \b{X}] = \beta_0\]

  1. The intercept \(\beta_0\) is required. Otherwise, most of this section wouldn’t make sense!

The test procedure is based on the following decomposition of deviation:

\[y_i-\bar{y} = (y_i-\hat{y}_i) + (\hat{y}_i-\bar{y})\]

We define:

  • \(SST = \sum\left(y_i-\bar{y}\right)^2\)

  • \(SSE = \sum\left(y_i-\hat{y}_i\right)^2\)

  • \(SSR = \sum\left(\hat{y}_i-\bar{y}\right)^2\)

We have already seen that

\[SSE=\b{y}'(\b{I}-\b{H})\b{y}\]

and now:

\[SSR=\b{y}'\left(\b{H}-\frac{1}{n}\b{J}\right)\b{y}\]

\[SST=\b{y}'\left(\b{I}-\frac{1}{n}\b{J}\right)\b{y}\]

with \(\b{J}=\b{1}\b{1}'\)

In matrix notation, it is clear that:

  1. \(SST = SSE + SSR\)
  1. all 3 quantities are quadratic forms of a multivariate gaussian r. v. \(\implies\) non-central \(\chi^2\) distributions

Theorem

  1. \(\frac{SSE}{\sigma^2}\sim \chi^2_{(n-p)}\) (already stated)

  2. Under \(H_0\)\(: \frac{SST}{\sigma^2}\sim \chi^2_{(n-1)}\) (known result under i.i.d.)

  3. Under \(H_0\): \(\frac{SSR}{\sigma^2}\sim \chi^2_{(p-1)}\)

  4. \(SSR\) and \(SSE\) are independent (show that \((\b{I}-\b{H})\left(\b{H}-\frac{1}{n}\b{J}\right)=\b{0}\))

And so, to test the hypotheses we can use the overall F-test with the statistic:

\[F=\frac{\frac{SSR}{p-1}}{\frac{SSE}{n-p}} = \frac{MSR}{MSE}\stackrel{H_0}{\sim}F_{(p-1, n-p)}\]

\(H_0\) can be rejected if \(F_o > F^{-1}_{F_{(p-1, n-p)}}(1-\alpha)\)

Related to the previous test we already know

\[R^2=\frac{SSR}{SST}=1-\frac{SSE}{SST}\] that is now called the multiple coefficient of determination.

Note that:

  1. \(SST\) is fixed and does not depend on the model

  2. \(SSR\) can not become smaller if we expand a model

\(\therefore\) \(R^2\) never decreases with the inclusion of terms in any model!

Because of this \(R^2\) inflation, it should be preferred the

Adjusted coefficient of determination

\[R^2_a=1-\frac{MSE}{MST}=1-\frac{(n-1) SSE}{(n-p) SST}\]

Notes

  1. \(R^2_a\) penalizes the inclusion of terms

  2. The interpretation is lost

  3. \(R^2_a\) may be negative, unlike \(R^2\)

The general linear hypotheses

Many hypotheses on the regression coefficients can be written in the generic form

\[H_0: \b{C}\bs{\beta}= \b{m}\]

versus

\[H_1: \b{C}\bs{\beta}\neq \b{m}\]

where \(\b{C}_{(c\times p)}\) and \(\b{m}_{(c\times 1)}\) are constant, with \(r(\b{C})=c\leq p\), and the linear system \(\b{C}\bs{\beta}= \b{m}\) has at least one solution.

  1. \(H_0: \beta_k=0\)

    \(\b{C}=(0, \ldots, 0, 1, 0, \ldots, 0)\), \(\b{m} = (0)\) and \(r(\b{C})=1\)

  1. \(H_0: \bs{\beta} = \b{0}_{(p)}\)

    \(\b{C}=\b{I}_{(p)}\), \(\b{m} = \b{0}_{(p)}\) and \(r(\b{C})=p\)

  2. \(H_0: \forall i>0 : \beta_i = 0\)

    \(\b{C}=(\b{0}_{(p-1)}\ \b{I}_{(p-1)})\), \(\b{m} = \b{0}_{(p-1)}\) and \(r(\b{C})=p-1\)

Under \(H_0\):

\[\b{C}\hat{\bs{\beta}} - \b{m} \sim N_c\left(\b{0}, \sigma^2\b{C}(\b{X}'\b{X})^{-1}\b{C}'\right)\]

With \(Q=\left(\b{C}\hat{\bs{\beta}} - \b{m}\right)'\left(\b{C}(\b{X}'\b{X})^{-1}\b{C}'\right)^{-1}\left(\b{C}\hat{\bs{\beta}} - \b{m}\right)\) we have

  1. \(\dfrac{Q}{\sigma^2} \stackrel{H_0}{\sim} \chi^2_{(c)}\)

  2. \(F=\dfrac{n-p}{c}\dfrac{Q}{SSE} \stackrel{H_0}{\sim} F_{(c, n-p)}\)

and \(H_0\) can be rejected if \(F_o > F^{-1}_{F_{(c, n-p)}}(1-\alpha)\).

This framework provides a way to compare a larger model (L) with a reduced one (R) defined by the restritions.

A model R (reduced model) is said nested in a model L (larger model) if the set of terms of R is a subset or a linear subspace of the set of terms of L.

For nested models:

\[Q = SSE(R) - SSE(L)\]

and the F-statistic can be written in a more practical way:

\[F = \frac{SSE(R) - SSE(L)}{SSE(L)}\frac{df_L}{df_R-df_L} \stackrel{H_0}{\sim}F_{(df_R-df_F, df_F)}\]

Note

Application of this general framework to the initial examples of hypotheses would produce exactly the previous tests. For the univariate t-tests we would have \(F_o =T_o^2\).

The tests to compare nested models are usually called partial F-tests.

model <- lm(demand ~ ., data=product)
summary(model)

Call:
lm(formula = demand ~ ., data = product)

Residuals:
     1      2      3      4      5      6      7      8      9 
 0.876 -0.972 -1.492  0.762 -4.612 -5.462 -2.129 11.653  1.377 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)    60.014     36.191    1.66    0.158  
education      10.718      4.530    2.37    0.064 .
income         -0.751      0.395   -1.90    0.116  
urbanization    0.240      1.012    0.24    0.822  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.29 on 5 degrees of freedom
Multiple R-squared:  0.845, Adjusted R-squared:  0.753 
F-statistic: 9.11 on 3 and 5 DF,  p-value: 0.0181

reduced_model <- update(model, . ~ . - urbanization)
summary(reduced_model)

Call:
lm(formula = demand ~ education + income, data = product)

Residuals:
   Min     1Q Median     3Q    Max 
-5.873 -2.730  0.065  1.111 11.481 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   63.021     31.114    2.03    0.089 . 
education     11.517      2.777    4.15    0.006 **
income        -0.816      0.261   -3.12    0.021 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.77 on 6 degrees of freedom
Multiple R-squared:  0.844, Adjusted R-squared:  0.791 
F-statistic: 16.2 on 2 and 6 DF,  p-value: 0.00383
anova(reduced_model, model)
Analysis of Variance Table

Model 1: demand ~ education + income
Model 2: demand ~ education + income + urbanization
  Res.Df RSS Df Sum of Sq    F Pr(>F)
1      6 200                         
2      5 198  1      2.22 0.06   0.82
null_model <- lm(demand ~ 1, data=product)
anova(null_model, reduced_model)
Analysis of Variance Table

Model 1: demand ~ 1
Model 2: demand ~ education + income
  Res.Df  RSS Df Sum of Sq    F Pr(>F)   
1      8 1279                            
2      6  200  2      1079 16.2 0.0038 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Extra sums of squares

Each regression model \(M\) leads to a particular decomposition of the SST: \[SST = SSR(M) + SSE(M)\]

For the null model we have SSE(\(\emptyset\)) = SST. The inclusion of terms will increase the SSR while decreasing the SSE.

To identify significant terms it may be useful to analyse their contributions to this variation transfer.

To do that we define partial or extra sums of squares as

\[SSR(L) - SSR(R) = SSE(R) - SSE(L)\]

The more useful extra sums of squares are the sequential ones:

\[SSR(x_{j+1}\mid x_1,\ldots,x_{j}) = SSR(x_1,\ldots,x_{j+1}) - SSR(x_1,\ldots,x_{j})\]

Another way of using the extra sums of squares is

\(SSR(x_1,\ldots,x_{p-1})=\)

\(\quad\quad = SSR(x_1,\ldots,x_{j}) + SSR(x_{j+1}, \ldots, x_{p-1}\mid x_1,\ldots,x_{j})\)

This decomposition is not unique and can be applied sequentially. For example, in a model with 3 covariates we can have:

\(SSR(x_1,x_2,x_3)=SSR(x_1)+SSR(x_2\mid x_1)+SSR(x_3\mid x_1,x_2)\)

\(SSR(x_1,x_2,x_3)=SSR(x_2)+SSR(x_3\mid x_2)+SSR(x_1\mid x_2,x_3)\)

\(SSR(x_1,x_2,x_3)=SSR(x_1)+SSR(x_2, x_3\mid x_1)\)

The extra sums of squares are also useful to define the coefficients of partial determination.

For example,

\[R^2_{3\mid 1,2}=\frac{SSR(x_3\mid x_1, x_2)}{SSE(x_1, x_2)}\]

is the coefficient of partial determination between \(y\) and \(x_3\), given that \(x_1\) and \(x_2\) are in the model, and it is interpreted as the proportion of variation not explained by the reduced model that can be explained by the larger model.

Finally, we can consider a sequence of partial F-tests.

anova(model)
Analysis of Variance Table

Response: demand
             Df Sum Sq Mean Sq F value Pr(>F)   
education     1    754     754   19.06 0.0072 **
income        1    325     325    8.21 0.0352 * 
urbanization  1      2       2    0.06 0.8221   
Residuals     5    198      40                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The statistic here is a variation of previous ones:

\[F^*=\frac{(n-p)}{c}\frac{SSR(x_{j+1}, \ldots, x_{j+c}\mid x_1,\ldots,x_{j})}{SSE(x_1,\ldots,x_{p-1})}\stackrel{H_0}{\sim}F_{(c, n-p)}\]

Another view of the previous ANOVA table with added information:

m0 <- lm(demand ~ 1, data= product)
m1 <- update(m0, . ~ . + education)
m2 <- update(m1, . ~ . + income)
m3 <- update(m2, . ~ . + urbanization)
anova(m0, m1, m2, m3)
Analysis of Variance Table

Model 1: demand ~ 1
Model 2: demand ~ education
Model 3: demand ~ education + income
Model 4: demand ~ education + income + urbanization
  Res.Df  RSS Df Sum of Sq     F Pr(>F)   
1      8 1279                             
2      7  525  1       754 19.06 0.0072 **
3      6  200  1       325  8.21 0.0352 * 
4      5  198  1         2  0.06 0.8221   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.4 Categorical covariates

To consider the inclusion of categorical covariates in a regression model it is particularly useful to look at the response surface:

\[E[Y\mid \b{x}] = \b{x}'\bs{\beta}\]

Binary covariates (2 levels)

Suppose that \(p=3\) and \(x_2\) is binary (0/1). The response surface for the first-order model is:

\[\begin{align*} E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_2 x_2 =\\[0.25cm] & = \begin{cases}\beta_0+\beta_1 x_1, & x_2=0\\[0.25cm](\beta_0 + \beta_2)+\beta_1 x_1, & x_2=1\end{cases} \end{align*}\]

\(\leadsto\) we are fitting 2 parallel response surfaces!

Note

To allow non-parallel surfaces we need to include interactions:

\[\begin{align*} E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 =\\[0.25cm] & = \begin{cases}\beta_0+\beta_1 x_1, & x_2=0\\[0.25cm](\beta_0 + \beta_2)+(\beta_1 + \beta_3) x_1, & x_2=1\end{cases} \end{align*}\]

\(E[y\mid x_1, x_2=1] - E[y\mid x_1, x_2=0] = \beta_{2} + \beta_3 x_1\)

More than 2 levels

Suppose now that \(x_2\) is a categorical variable or factor with 3 levels. If we code the levels as 0, 1 and 2 we have:

\[\begin{align*} E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_2 x_2 =\\[0.25cm] & = \begin{cases}\beta_0+\beta_1 x_1, & x_2=0\\[0.25cm](\beta_0 + \beta_2)+\beta_1 x_1, & x_2=1\\[0.25cm](\beta_0 + 2\beta_2)+\beta_1 x_1, & x_2=2\end{cases} \end{align*}\]

Not only we have parallel response surfaces but they are also equally distant and ordered. This is, in general, undesirable.

We could associate a binary variable to each level of the factor \(x_2\rightsquigarrow (x_{2,0}, x_{2,1}, x_{2,2})\) but this would lead to an incomplete rank linear model.

Since \(x_2\) has 3 levels we will use only 2 indicator or dummy variables:

Level \(x_{2,1}\) \(x_{2,2}\)
0 0 0
1 1 0
2 0 1

This is called a reference level encoding and the level for which \(\b{x}=\b{0}\) is called the reference level.

\[\begin{align*} E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_{2,1} x_{2,1} + \beta_{2,2} x_{2,2} =\\[0.25cm] & = \begin{cases}\beta_0+\beta_1 x_1, & x_2=0\\[0.25cm](\beta_0 + \beta_{2,1})+\beta_1 x_1, & x_2=1\\[0.25cm](\beta_0 + \beta_{2,2})+\beta_1 x_1, & x_2=2\end{cases} \end{align*}\]

What can we do with these dummies?

\(E[y\mid x_1, x_2=i] - E[y\mid x_1, x_2=0] = \beta_{2,i},\ i=1,2\)

Again, the inclusion of interactions will remove the parallelism.

\[\begin{align*} E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_{2,1} x_{2,1} + \beta_{2,2} x_{2,2} + \\[0.25cm] & + \beta_{3} x_1 x_{2,1} + \beta_4 x_1 x_{2,2} \\[0.25cm] & = \begin{cases}\beta_0+\beta_1 x_1, & x_2=0\\[0.25cm](\beta_0 + \beta_{2,1})+(\beta_1 + \beta_3) x_1, & x_2=1\\[0.25cm](\beta_0 + \beta_{2,2}) + (\beta_1 + \beta_4) x_1, & x_2=2\end{cases} \end{align*}\]

Notes

  1. Instead of the binary encoding in the dummy variables, other values (contrasts) are also used that provide different interpretations for the regression parameters.
  1. If all covariates are categorical we will use Analysis of Variance models (ANOVA).

Another application of categorical variables

Piecewise linear regression

Suppose \(x_1\) be a quantitative covariate and let \(x_2=I_{[c, +\infty[}(x_1)\).

  1. Consider

    \[\begin{align*} E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_2 (x_1-c)x_2 =\\[0.25cm] & = \begin{cases}\beta_0+\beta_1 x_1, & x_1<c\\[0.25cm](\beta_0 - c \beta_2)+(\beta_1 + \beta_2) x_1, & x_1\geq c\end{cases} \end{align*}\]

    \(\leadsto\) continuous piecewise regression model

  1. Consider

    \[\begin{align*} E[Y\mid \b{x}] & = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1-c)x_2 =\\[0.25cm] & = \begin{cases}\beta_0+\beta_1 x_1, & x_1<c\\[0.25cm](\beta_0 + \beta_2 - c \beta_3)+(\beta_1 + \beta_3) x_1, & x_1\geq c\end{cases} \end{align*}\]

    \(\leadsto\) discontinuous piecewise regression model

pages
grid_view zoom_out zoom_in swap_horiz autorenew visibility_off
menu
fullscreen
stylus_note ink_eraser delete_forever content_paste navigate_before navigate_next
draw