lm with dummies' interactions

107 Views Asked by At

I have been using the dataset Prestige from mdhglm package. I was interested to understand how my model would change if I considered the interactions between dummies (of the predictor 'type'). I wasn't having any problems, but in the output the only one that is causing me problem is the dummy of 'blue collars' because it says NA to the dummy itself and because of that even to the interactions between the other predictors and blue collar. But I don't have any NA and my dummy is working fine, so I don't understand. Can you please help me?

Prestige3$professional <- ifelse(Prestige3$type == "prof", 1, 0)
Prestige3$white_collars <- ifelse(Prestige3$type == "wc", 1, 0)
Prestige3$blue_collars <- ifelse(Prestige3$type == "bc", 1, 0)

modello_interazioni <- lm(prestige ~ women * professional + education * professional + income_log * professional + women * white_collars + education * white_collars + income_log * white_collars +
women * blue_collars + education * blue_collars +  income_log * blue_collars, data = Prestige3) 

summary(modello_interazioni)

I have tried to create dummies again because I thought that it could be the problem, but they are working. I have controlled again the NA, but I don't have any.

2

There are 2 best solutions below

1
Richard Summers On

In the output table, you may notice that it says "Coefficients: (4 not defined because of singularities)" just above the table of coefficients.

So, this can be for a number of reasons. Usually, this is because of colinearity, and it creates an angry model. In this case, you don't need the dummy variables because you can just set them as categorical variables in your formula using C() to make it a categorical variable.

model <- (lm(prestige ~ women * C(type) + 
                            education * C(type) + 
                            income * C(type), 
                          data = Prestige))
summary(model)

Which then gives this table:

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -5.822e+00  7.311e+00  -0.796  0.42803    
women                  1.343e-01  4.656e-02   2.885  0.00494 ** 
C(type)prof            2.436e+01  1.351e+01   1.803  0.07496 .  
C(type)wc             -2.178e+01  1.727e+01  -1.261  0.21081    
education              1.625e+00  9.163e-01   1.773  0.07971 .  
income                 4.692e-03  6.691e-04   7.013 5.00e-10 ***
women:C(type)prof     -1.601e-01  6.506e-02  -2.460  0.01588 *  
women:C(type)wc        2.893e-02  1.117e-01   0.259  0.79619    
C(type)prof:education  1.512e+00  1.235e+00   1.224  0.22423    
C(type)wc:education    2.123e+00  2.190e+00   0.970  0.33491    
C(type)prof:income    -4.144e-03  7.132e-04  -5.810 1.03e-07 ***
C(type)wc:income      -7.527e-04  1.814e-03  -0.415  0.67924 

Hope that answers your question. ~R

0
Onyambu On

There are two situations where the coefficients will be NA.

  • When you have more predictors than the number of observations. ie You are unable to estimate all the coefficients. In this situation even the standard error will be NA and t-tests/p-values will all be NA. You use half plots to determine the effects

  • When there is complete aliases.

In your case, you are experiencing the second situation. Two columns that are exactly the same. or a column derived from a combination of the others perfectly without randomness. Try using the function alias to determine the columns that are exactly the same:

alias(modello_interazioni)

Notice from the above that the column variables which have non-0 values are completely aliased to the rowname variables. eg blue_collars = Intercept + professionals + white_collars. Due to this perfectly linear relationship, one must be NA.


Point to note, you should consider running your code as:

summary(lm(prestige~(women + education + income_log) * type, Prestige3))

                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -172.83613   26.17288  -6.604 3.17e-09 ***
women                  0.14059    0.04758   2.955 0.004033 ** 
education              2.42215    0.88082   2.750 0.007266 ** 
income_log            21.78584    3.15780   6.899 8.38e-10 ***
typeprof             147.25606   38.83048   3.792 0.000277 ***
typewc               -24.50672   68.08447  -0.360 0.719770    
women:typeprof        -0.16678    0.06888  -2.421 0.017561 *  
women:typewc           0.05693    0.11155   0.510 0.611098    
education:typeprof     0.68858    1.23286   0.559 0.577937    
education:typewc       0.83715    2.17074   0.386 0.700706    
income_log:typeprof  -16.29484    4.55783  -3.575 0.000577 ***
income_log:typewc      1.06471    8.95592   0.119 0.905645  

which gives the result you want. No need to manually create the dummy variables unless you are implementing linear regression from scratch.

summary(modello_interazioni)
Coefficients: (4 not defined because of singularities)
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -172.83613   26.17288  -6.604 3.17e-09 ***
women                       0.14059    0.04758   2.955 0.004033 ** 
professional              147.25606   38.83048   3.792 0.000277 ***
education                   2.42215    0.88082   2.750 0.007266 ** 
income_log                 21.78584    3.15780   6.899 8.38e-10 ***
white_collars             -24.50672   68.08447  -0.360 0.719770    
blue_collars                     NA         NA      NA       NA    
women:professional         -0.16678    0.06888  -2.421 0.017561 *  
professional:education      0.68858    1.23286   0.559 0.577937    
professional:income_log   -16.29484    4.55783  -3.575 0.000577 ***
women:white_collars         0.05693    0.11155   0.510 0.611098    
education:white_collars     0.83715    2.17074   0.386 0.700706    
income_log:white_collars    1.06471    8.95592   0.119 0.905645    
women:blue_collars               NA         NA      NA       NA    
education:blue_collars           NA         NA      NA       NA    
income_log:blue_collars          NA         NA      NA       NA