I have a bunch of categorical variables as predictors for which I use the factor notation i.year for example. There are a few years (with a bunch of observations) that predict the outcome exactly. Stata very slowly goes through these cases and drops them. In this setup, I have the following questions:
Can I manually (through some simple code) figure out these perfect prediction groups and exclude those observations (through a
ifcondition on a flag) while runningprobit? Is this justifiable, and will my estimates be the same? (as Stata goes through dropping these cases very slowly, I still don't have the estimates where Stata drops them insideprobitexecution)I need to predict Inverse Mills Ratio from this
probitwhich is needed for my further analysis. If I take the manual approach above, and then runpredict IMR, scoreon ALL data, will this give the correct IMR for those excluded observations which lead to perfect prediction?
I have a few comments for you.
(1) First, the assumed distribution (logit versus probit) -- sometimes called the 'link function' --shouldn't determine the association between a categorical independent variable and your dependent variable. So, the results that you obtain with logit should be transferrable to probit.
(2) Stata typically removes variables when there is insufficient variability in the independent variable. If want to remove these by-hand, you will need to set up each of the indicators that you DO want to include either manually or with a loop structure. For instance, let's say you want to keep 1984, 1985, and 1987, but drop 1986. One way of doing this is to create each dummy variable by hand and manually add it to the regression. Another way would be to loop over the years that you do want to use.
(3) Finally, the manual changes should not affect the inverse Mills Ratio. However, the best 'quick and dirty' way to check this is to run a simple nested model using both methods -- (a) removing the variables prior versus (b) letting Stata remove them -- and check that you get the same results across both simple models.
Good luck!