I have a matched employer-employee data set with 20 years of information on around 2.000.000 individuals and 400.000 firms. Amongst several worker- and firm-level covariates, I have worker, firm and occupation identifiers. My objective is to obtain 3 vectors of worker, firm and occupation fixed effects, estimated through a wage equation. I want to implement the Gauss-Seidel algorithm to obtain the exact least squares solution.
In a nutshell, I need to start with the three fixed effects' variables equal to zero, followed by estimation of the wage equation and update of each of the three coefficients sequentially until convergence is met, based on some convergence criterion (in this case, I'm following Guimaraes & Portugal 2010 by using a comparison between the residual sum of squares). I tried to do it in stata but it was taking too long to converge, so I'm now trying to do it in R.
My R code:
estimate_fixed_effects = function(df, max_iter = 1000, tol = 1e-6) {
y = df$real_lrganho
X = model.matrix(~ p_age2 + antig + antig2 + firm_age + lfprod, df)
beta = rep(0, ncol(X))
worker_fe = rep(0, length(unique(df$ntrab)))
firm_fe = rep(0,length(unique(df$emp_id)))
occupation_fe = rep(0,length(unique(df$occupation4)))
for (iter in 1:max_iter){
for (i in 1:length(worker_fe)) {
y_i = y[df$ntrab == i]
X_i = X[df$ntrab == i,]
beta = lm(y_i - X_i %*% beta ~ 0)$coefficients
worker_fe[i] = mean(y_i - X_i %*% beta)
}
for (j in 1:length(firm_fe)) {
y_j = y[df$emp_id == j]
X_j = X[df$emp_id == j,]
beta = lm(y_j - X_j %*% beta ~ 0)$coefficients
firm_fe[j] = mean(y_j - X_j %*% beta)
}
for (k in 1:length(occupation_fe)) {
y_k = y[df$occupation4 == k]
X_k = X[df$occupation4 == k,]
beta = lm(y_k - X_k %*% beta ~ 0)$coefficients
occupation_fe[k] = mean(y_k - X_k %*% beta)
}
beta_new = lm(y - rowSums(model.matrix(~ ntrab + emp_id + occupation4, df) %*% c(worker_fe, firm_fe, occupation_fe)) ~ X - 1)$coefficients
if (max(abs(beta_new - beta))<tol) {
break
}
beta = beta_new
}
return(list(beta = beta, worker_fe = worker_fe, firm_fe = firm_fe, occupation_fe = occupation_fe))
}
The output should ideally be 3 vectors of estimated fixed effects for the 3 identified groups. However, after running result = estimate_fixed_effects(df), I got the error message "Error in X[df$ntrab == i, ] : (subscript) logical subscript too long". When trying to make sense of this output, I realize that I'm not entirely sure of how model.matrix() even works, and how to make the function do what I want. Any ideas?
Edit: I fixed the first error message, it had to do with a mismatch between the model matrix and df$ntrab, because of missing values on lfprod. Now I'm facing the following error which I don't understand: Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'mean': non-conformable arguments. Any ideas?