perfect multicollinearity in glm

181 Views Asked by At

I wanted to know how to solve the problem of perfect multicollinearity in a glm that I fit in R I wanna to see if the morphological measures can predict the a bird's arrival day in territory, so I have tarsus, wing and tail, I'm also want to see the difference in males and females.

So, I'm using the code:

myggod <- glm(day_territory ~ sex * (Right_tarsus + Right_wing + 
                  Tail_length), data = territory, family = "poisson")

that show the follow output:

                         Estimate Std. Error z value Pr(>|z|)   
(Intercept)             19.626581  17.831173   1.101  0.27103   
sexfemale              -14.645707  17.852832  -0.820  0.41201   
sexmale                -12.343274  17.835662  -0.692  0.48890   
Right_tarsus            -0.920874   1.233841  -0.746  0.45546   
Right_wing              -0.007466   0.016571  -0.451  0.65233   
Tail_length             -0.043216   0.013195  -3.275  0.00106 **
sexfemale:Right_tarsus   0.883152   1.234115   0.716  0.47423   
sexmale:Right_tarsus     0.846497   1.233209   0.686  0.49245   
sexfemale:Right_wing     0.018863   0.020855   0.904  0.36574   
sexmale:Right_wing             NA         NA      NA       NA   
sexfemale:Tail_length    0.021428   0.015584   1.375  0.16911   
sexmale:Tail_length            NA         NA      NA       NA   

So, I have perfect multicollinearity to male's tail and wing

I already tried use scale and center = true, use the measures minus the mean, use log and use a PC1 made of an PCA using wing and tail

nothing worked, i have the same issue with all of these methods, even when both measures are just the PC1 the same NAs appears ...

So, how can I solve it?

1

There are 1 best solutions below

2
Len Greski On

We can eliminate the overparameterization problem by removing the interaction effects from the model.

if(!dir.exists("./data")) dir.create("./data")
download.file("https://drive.google.com/uc?export=download&id=1OMaVfeUipRsa1njydYTAgls9pPCVFCdD",
              "./data/bird_stats.csv",mode="w")

df <- read.csv("./data/bird_stats.csv",sep = ";")

aModel <-glm(day_territory ~ sex + Right_tarsus + Right_wing + Tail_length, 
             data = df, family = "poisson")
summary(aModel)

...and the output:

Call:
glm(formula = day_territory ~ sex + Right_tarsus + Right_wing + 
    Tail_length, family = "poisson", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.4278  -1.5146  -0.4210   0.9837   6.6771  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   6.010191   0.670813   8.960  < 2e-16 ***
sexfemale     0.163442   0.098671   1.656  0.09763 .  
sexmale      -0.167495   0.102225  -1.638  0.10132    
Right_tarsus -0.056499   0.019188  -2.944  0.00323 ** 
Right_wing    0.002091   0.009921   0.211  0.83311    
Tail_length  -0.030275   0.006944  -4.360  1.3e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 712.81  on 101  degrees of freedom
Residual deviance: 493.95  on  96  degrees of freedom
  (83 observations deleted due to missingness)
AIC: 1085.5

Number of Fisher Scoring iterations: 4

The AIC on the overparameterized model is 1087.8, so the model with fewer parameters is slightly better than the overparameterized one.

Note that almost half the observations in the data frame were deleted from the analysis due to missing values. You'll need to review the missing data and make some decisions about strategies for interpolating missing data, or collect more data to assess whether the sex variable is meaningful.

Also, the dependent variable in a poisson model is typically a count, but from the original question it's hard to understand why a poisson model is being used here. That is, if the variables Right_tarsus Tight_wing and Tail_length are size measurements of birds, why would size measurements predict counts?

If the dependent variable is the day of arrival in a specific location, a poisson model probably isn't the right model.