Dropping observations if there is a gap (missing observation in specific date range)

50 Views Asked by At

I have a panel data for companies between years 2001-2019. I need to clean this dataset so that I drop all firms which have a missing data in one of the explanatory variables for regression. The problem is that I only need to drop whole firm if there is a gap in data. If the firm has data for 2003-2007 I need to keep it. Only if it has data for example from 2001 to 2003 and than 3 years missing and than again have all data from 2007 to 2019 I need to drop it. In this second case there is a gap and I need to drop all observations for that firm (even if this happens only for one variable).

You will help me a lot if you have any ideas. I tried a lot, but could not find a way how to do this.

I tried this, but does not work properly:

firms_with_gaps <-
    panel_data %>%
    group_by(company_id) %>%
    filter(!(all(year >= 2001 & year <= 2019))) %>%
    pull(company_id)

Thank you very much for your ideas!

1

There are 1 best solutions below

0
Umar On

Instead of using all, you should use any to check if there are any years missing within the specified range for each firm. updated code

library(dplyr)

panel_data_cleaned <- panel_data %>%
  group_by(company_id) %>%
  filter(all(year >= 2001 & year <= 2019) & any(!(year %in% 2001:2019)))

or

# Identify firms with gaps in any of the explanatory variables
firms_with_gaps <- panel_data %>%
  group_by(company_id) %>%
  filter(any(is.na(c(exp_var1, exp_var2, ...)))) %>%  # Include all relevant variables
  ungroup() %>%
  distinct(company_id)