Data set obtained from Ross et al. (2009).
A study attempted to relate job satisfaction to income (in 1,000) and seniority for a random sample of 9 workers. The job satisfaction value given for each worker is his/her own assessment of such, with a score of 1 being the lowest and 10 being the highest.
| Yearly Income | Years on the Job | Job Satisfaction |
|---|---|---|
| 27 | 8 | 5.6 |
| 22 | 4 | 6.3 |
| 34 | 12 | 6.8 |
| 28 | 9 | 6.7 |
| 36 | 16 | 7.0 |
| 39 | 14 | 7.7 |
| 33 | 10 | 7.0 |
| 42 | 15 | 8.0 |
| 46 | 22 | 7.8 |
What qualitative conclusions can you draw about how job satisfaction changes when income remains fixed and the number of years of service increases?
Predict the job satisfaction of an employee who has spent 5 years on the job and earns a yearly salary of $31,000.
Data are available from https://web.tecnico.ulisboa.pt/giovani.silva/ec/LabClasses_data2.txt
setwd("C:/Privado/Disciplinas/EC/Exercicios") # set your working directory
#setwd("~/Privado/Disciplinas/EC/Exercicios") # set your working directory
mydata2 <- read.table("LabClasses_data2.txt", header=TRUE, sep="")
dim(mydata2)
## [1] 9 3
head(mydata2)
## Income Years Satisfaction
## 1 27 8 5.6
## 2 22 4 6.3
## 3 34 12 6.8
## 4 28 9 6.7
## 5 36 16 7.0
## 6 39 14 7.7
summary(mydata2)
## Income Years Satisfaction
## Min. :22.00 Min. : 4.00 Min. :5.600
## 1st Qu.:28.00 1st Qu.: 9.00 1st Qu.:6.700
## Median :34.00 Median :12.00 Median :7.000
## Mean :34.11 Mean :12.22 Mean :6.989
## 3rd Qu.:39.00 3rd Qu.:15.00 3rd Qu.:7.700
## Max. :46.00 Max. :22.00 Max. :8.000
plot(mydata2$Income,mydata2$Satisfaction,xlab="Yearly Income",ylab="Job Satisfaction")
plot(mydata2$Years, mydata2$Satisfaction,xlab="Years on the Job",ylab="Job Satisfaction")
fit1 <- lm(Satisfaction ~ Income + Years, data=mydata2)
summary(fit1)
##
## Call:
## lm(formula = Satisfaction ~ Income + Years, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71365 -0.05968 0.04694 0.13145 0.34478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.84369 1.00508 2.829 0.030 *
## Income 0.16195 0.05512 2.938 0.026 *
## Years -0.11283 0.08001 -1.410 0.208
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3692 on 6 degrees of freedom
## Multiple R-squared: 0.8263, Adjusted R-squared: 0.7684
## F-statistic: 14.27 on 2 and 6 DF, p-value: 0.005239
newdata2 = data.frame(Income=31,Years=5)
predict(fit1, newdata2, interval="predict")
## fit lwr upr
## 1 7.299922 5.902068 8.697775
fit2 <- lm(Satisfaction ~ Income, data=mydata2)
summary(fit2)
##
## Call:
## lm(formula = Satisfaction ~ Income, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7627 -0.1791 0.1090 0.2806 0.3775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.98529 0.63637 6.263 0.000419 ***
## Income 0.08805 0.01825 4.824 0.001913 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3944 on 7 degrees of freedom
## Multiple R-squared: 0.7688, Adjusted R-squared: 0.7357
## F-statistic: 23.27 on 1 and 7 DF, p-value: 0.001913
anova(fit1, fit2)
## Analysis of Variance Table
##
## Model 1: Satisfaction ~ Income + Years
## Model 2: Satisfaction ~ Income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6 0.81785
## 2 7 1.08892 -1 -0.27107 1.9887 0.2082
Note: There are many other regression issues that we will revise later e.g. regression diagnostics.
Consider a r.v. \({\bf X}=(X_1,\ldots,X_k) \sim \text{Multinomial}(N,{\bf p}=(p_1,\ldots,p_k)\) such that \(\sum_{j=1}^k p_j=1\) and \(\sum_{j=1}^k X_j=N\). Find the maximum likelihood (ML) estimator of \({\bf p}\). Based on a sample with \(N\!=\!197\), \(x_1\!=\!125\), \(x_2\!=\!18\), \(x_3\!=\!20\), \(x_4\!=\!34\), what is its estimate?
By using the method of Lagrange multipliers, find the ML estimate of \({\bf p}\) under the restriction \(\sum_{j=1}^k p_j=1\).
\[ SS({\bf p},\lambda) = \sum_{j=1}^k x_j \log (p_j) - \lambda \biggl( \sum_{j=1}^k p_j-1 \biggr) , \] that is, \[ \begin{array}{l} \frac{\partial}{\partial p_j} SS({\bf p},\lambda) = x_j \frac{1}{p_j} - \lambda = 0 \ \Leftrightarrow \ p_j = \frac{x_j}{\lambda}, \ j=1,\ldots,k \\ \frac{\partial}{\partial \lambda} SS({\bf p},\lambda) = \sum_{j=1}^k p_j - 1 = 0 \ \Leftrightarrow \ \sum_{j=1}^k p_j = 1 \\ \therefore \ \sum_{j=1}^k \frac{x_j}{\lambda} = 1 \ \Rightarrow \ \lambda=N \quad \text{and} \quad p_j = \frac{x_j}{N}. \end{array} \]
c(125,18,20,34)/197
## [1] 0.63451777 0.09137056 0.10152284 0.17258883
In the scenario of Exercise 2, suppose that the probabilities are related by a single parameter \(\theta\), \(p_1=\frac{1}{2} \!+\! \frac{1}{4}\theta\), \(p_2=p_3=\frac{1}{4} \!-\! \frac{1}{4}\theta\), \(p_4=\frac{1}{4}\theta\), \(0\leq \theta \leq 1\) (Gentle, 2002).
Write a program to determine the solution of ML estimation by Fisher scoring method starting with \(\theta^{(0)}=0.5\).
Hint: See also Example 2.3 from Silva (2020).
f <- function(x) {
125*log((2+x)/4) + (18+20)*log((1-x)/4) + 34*log(x/4)
}
f.prime <- function(x) {
125/(2+x) - (18+20)/(1-x) + 34/x
}
f.double.prime <- function(x) {
-(197/4)*((1/(2+x)) + (2/(1-x)) + (1/x))
}
guess <- 0.5
tolerance <- .00001
root <- function(f.prime, f.double.prime, guess, tolerance) {
x = guess
while (abs(f.prime(x)) > tolerance) {
x = x - f.prime(x)/f.double.prime(x)
print(paste(x, "has been processed"))
}
x
}
root(f.prime, f.double.prime, guess, tolerance)
## [1] "0.633248730964467 has been processed"
## [1] "0.626534069269729 has been processed"
## [1] "0.626834428415782 has been processed"
## [1] "0.626820916319874 has been processed"
## [1] "0.626821524026551 has been processed"
## [1] 0.6268215
Gentle, J.E. (2002). Elements of Computational Statistics. Springer, New York.
Rizzo, M.L. (2019). Statistical Computing with R, 2nd edition. Chapman and Hall/CRC.
Ross, S.M. (2009). Introduction to Probability and Statistics for Engineers and Scientists, 4th edition. Elsevier/Academic Press.
Silva, G.L. (2023). Lecture Notes on Computational Statistics, with Exercises. IST, Lisbon.