Computational Statistics Course

Lab Class 6: Classic methods of estimation and algorithms

Example 1 - Multiple linear regression

Data set obtained from Ross et al. (2009).

A study attempted to relate job satisfaction to income (in 1,000) and seniority for a random sample of 9 workers. The job satisfaction value given for each worker is his/her own assessment of such, with a score of 1 being the lowest and 10 being the highest.

Yearly Income	Years on the Job	Job Satisfaction
27	8	5.6
22	4	6.3
34	12	6.8
28	9	6.7
36	16	7.0
39	14	7.7
33	10	7.0
42	15	8.0
46	22	7.8

What qualitative conclusions can you draw about how job satisfaction changes when income remains fixed and the number of years of service increases?

Predict the job satisfaction of an employee who has spent 5 years on the job and earns a yearly salary of $31,000.

Data are available from https://web.tecnico.ulisboa.pt/giovani.silva/ec/LabClasses_data2.txt

Read data and show the first rows of data.

setwd("C:/Privado/Disciplinas/EC/Exercicios") # set your working directory
#setwd("~/Privado/Disciplinas/EC/Exercicios") # set your working directory

mydata2 <- read.table("LabClasses_data2.txt", header=TRUE, sep="")

dim(mydata2)

## [1] 9 3

head(mydata2)

##   Income Years Satisfaction
## 1     27     8          5.6
## 2     22     4          6.3
## 3     34    12          6.8
## 4     28     9          6.7
## 5     36    16          7.0
## 6     39    14          7.7

Get some descriptive statistics, including plots of the response variable versus each covariate.

summary(mydata2)

##      Income          Years        Satisfaction  
##  Min.   :22.00   Min.   : 4.00   Min.   :5.600  
##  1st Qu.:28.00   1st Qu.: 9.00   1st Qu.:6.700  
##  Median :34.00   Median :12.00   Median :7.000  
##  Mean   :34.11   Mean   :12.22   Mean   :6.989  
##  3rd Qu.:39.00   3rd Qu.:15.00   3rd Qu.:7.700  
##  Max.   :46.00   Max.   :22.00   Max.   :8.000

plot(mydata2$Income,mydata2$Satisfaction,xlab="Yearly Income",ylab="Job Satisfaction")

plot(mydata2$Years, mydata2$Satisfaction,xlab="Years on the Job",ylab="Job Satisfaction")

Fit a multiple linear regression model.

fit1 <- lm(Satisfaction ~ Income + Years, data=mydata2)
summary(fit1)

## 
## Call:
## lm(formula = Satisfaction ~ Income + Years, data = mydata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71365 -0.05968  0.04694  0.13145  0.34478 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  2.84369    1.00508   2.829    0.030 *
## Income       0.16195    0.05512   2.938    0.026 *
## Years       -0.11283    0.08001  -1.410    0.208  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3692 on 6 degrees of freedom
## Multiple R-squared:  0.8263, Adjusted R-squared:  0.7684 
## F-statistic: 14.27 on 2 and 6 DF,  p-value: 0.005239

Predict the Satisfaction for Income 31 and Years 5.

newdata2 = data.frame(Income=31,Years=5) 
predict(fit1, newdata2, interval="predict")

##        fit      lwr      upr
## 1 7.299922 5.902068 8.697775

Compare model in 4 with a simple linear regression involving only Income.

fit2 <- lm(Satisfaction ~ Income, data=mydata2)
summary(fit2)

## 
## Call:
## lm(formula = Satisfaction ~ Income, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7627 -0.1791  0.1090  0.2806  0.3775 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.98529    0.63637   6.263 0.000419 ***
## Income       0.08805    0.01825   4.824 0.001913 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3944 on 7 degrees of freedom
## Multiple R-squared:  0.7688, Adjusted R-squared:  0.7357 
## F-statistic: 23.27 on 1 and 7 DF,  p-value: 0.001913

anova(fit1, fit2)

## Analysis of Variance Table
## 
## Model 1: Satisfaction ~ Income + Years
## Model 2: Satisfaction ~ Income
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1      6 0.81785                           
## 2      7 1.08892 -1  -0.27107 1.9887 0.2082

Note: There are many other regression issues that we will revise later e.g. regression diagnostics.

Example 2 - Maximum likelihood method

Consider a r.v. ${\bf X}=(X_1,\ldots,X_k) \sim \text{Multinomial}(N,{\bf p}=(p_1,\ldots,p_k)$ such that $\sum_{j=1}^k p_j=1$ and $\sum_{j=1}^k X_j=N$. Find the maximum likelihood (ML) estimator of ${\bf p}$. Based on a sample with $N\!=\!197$, $x_1\!=\!125$, $x_2\!=\!18$, $x_3\!=\!20$, $x_4\!=\!34$, what is its estimate?

By using the method of Lagrange multipliers, find the ML estimate of ${\bf p}$ under the restriction $\sum_{j=1}^k p_j=1$.

The ML estimate of ${\bf p}$ is obtained from maximization of:

\[ SS({\bf p},\lambda) = \sum_{j=1}^k x_j \log (p_j) - \lambda \biggl( \sum_{j=1}^k p_j-1 \biggr) , \] that is, \[ \begin{array}{l} \frac{\partial}{\partial p_j} SS({\bf p},\lambda) = x_j \frac{1}{p_j} - \lambda = 0 \ \Leftrightarrow \ p_j = \frac{x_j}{\lambda}, \ j=1,\ldots,k \\ \frac{\partial}{\partial \lambda} SS({\bf p},\lambda) = \sum_{j=1}^k p_j - 1 = 0 \ \Leftrightarrow \ \sum_{j=1}^k p_j = 1 \\ \therefore \ \sum_{j=1}^k \frac{x_j}{\lambda} = 1 \ \Rightarrow \ \lambda=N \quad \text{and} \quad p_j = \frac{x_j}{N}. \end{array} \]

Assuming the corresponding Hessian matrix is positive definite, $\widehat{\bf p} = \frac{1}{N} {\bf X}$ is the ML estimator and its estimate is

c(125,18,20,34)/197

## [1] 0.63451777 0.09137056 0.10152284 0.17258883

Example 3 - Newton-Raphson method

In the scenario of Exercise 2, suppose that the probabilities are related by a single parameter $\theta$, $p_1=\frac{1}{2} \!+\! \frac{1}{4}\theta$, $p_2=p_3=\frac{1}{4} \!-\! \frac{1}{4}\theta$, $p_4=\frac{1}{4}\theta$, $0\leq \theta \leq 1$ (Gentle, 2002).

Write a program to determine the solution of ML estimation by Fisher scoring method starting with $\theta^{(0)}=0.5$.

Hint: See also Example 2.3 from Silva (2020).

f <- function(x) {
  125*log((2+x)/4) + (18+20)*log((1-x)/4) + 34*log(x/4) 
}

f.prime <- function(x) {
  125/(2+x) - (18+20)/(1-x) + 34/x 
}

f.double.prime <- function(x) {
  -(197/4)*((1/(2+x)) + (2/(1-x)) + (1/x))
}

guess <- 0.5
tolerance <- .00001

root <- function(f.prime, f.double.prime, guess, tolerance) {
  x = guess
  while (abs(f.prime(x)) > tolerance) {
    x = x - f.prime(x)/f.double.prime(x)
    print(paste(x, "has been processed"))
  }
  x
}

root(f.prime, f.double.prime, guess, tolerance)

## [1] "0.633248730964467 has been processed"
## [1] "0.626534069269729 has been processed"
## [1] "0.626834428415782 has been processed"
## [1] "0.626820916319874 has been processed"
## [1] "0.626821524026551 has been processed"

## [1] 0.6268215

References:

Gentle, J.E. (2002). Elements of Computational Statistics. Springer, New York.

Rizzo, M.L. (2019). Statistical Computing with R, 2nd edition. Chapman and Hall/CRC.

Ross, S.M. (2009). Introduction to Probability and Statistics for Engineers and Scientists, 4th edition. Elsevier/Academic Press.

Silva, G.L. (2023). Lecture Notes on Computational Statistics, with Exercises. IST, Lisbon.