Dropping perfect prediction cases (groups based on a categorical variable) before running probit

78 Views Asked by At

I have a bunch of categorical variables as predictors for which I use the factor notation i.year for example. There are a few years (with a bunch of observations) that predict the outcome exactly. Stata very slowly goes through these cases and drops them. In this setup, I have the following questions:

  1. Can I manually (through some simple code) figure out these perfect prediction groups and exclude those observations (through a if condition on a flag) while running probit? Is this justifiable, and will my estimates be the same? (as Stata goes through dropping these cases very slowly, I still don't have the estimates where Stata drops them inside probit execution)

  2. I need to predict Inverse Mills Ratio from this probit which is needed for my further analysis. If I take the manual approach above, and then run predict IMR, score on ALL data, will this give the correct IMR for those excluded observations which lead to perfect prediction?

1

There are 1 best solutions below

3
Kathryn Roman Banda On

I have a few comments for you.

(1) First, the assumed distribution (logit versus probit) -- sometimes called the 'link function' --shouldn't determine the association between a categorical independent variable and your dependent variable. So, the results that you obtain with logit should be transferrable to probit.

(2) Stata typically removes variables when there is insufficient variability in the independent variable. If want to remove these by-hand, you will need to set up each of the indicators that you DO want to include either manually or with a loop structure. For instance, let's say you want to keep 1984, 1985, and 1987, but drop 1986. One way of doing this is to create each dummy variable by hand and manually add it to the regression. Another way would be to loop over the years that you do want to use.

(3) Finally, the manual changes should not affect the inverse Mills Ratio. However, the best 'quick and dirty' way to check this is to run a simple nested model using both methods -- (a) removing the variables prior versus (b) letting Stata remove them -- and check that you get the same results across both simple models.

Good luck!