I have one dataframe with a grouping ID ('Site location') where different events have happened between a 'start date_time' and an 'end date_time'. I want to figure out how I can remove rows where an event's start date_time is earlier than or equal to the end date_time of an earlier event that happened at that site location. Or in other words, how can I remove an event that already occurs within higher up in the dataframe (with the same 'Site location').
Some example data:
df <- data.frame("Site location" = rep(c("A","B"),each = 5),
"Start date_time" = as.POSIXct(c("2022-01-01 00:00","2022-01-01 00:45","2022-01-01 01:00","2022-01-01 03:00","2022-01-01 04:00",
"2022-01-01 00:00","2022-01-01 01:00","2022-01-01 02:00","2022-01-01 03:00","2022-01-01 04:00")),
"End date_time" = as.POSIXct(c("2022-01-01 01:00","2022-01-01 01:15","2022-01-01 03:00","2022-01-01 03:15","2022-01-01 05:00",
"2022-01-01 00:45","2022-01-01 01:45","2022-01-01 02:45","2022-01-01 03:45","2022-01-01 04:45")))
So in this example, I would like to remove rows 2 and 3 because their start date/time is earlier or equal to the end date/time of row 1. The removal of row 3 means that row 4 should be kept in (if row 3 wasn't removed, I'd want this filtered out as the start date of row 4 == the end date of row 3). All other rows can be kept as there are no other cross overs.
Many thanks in advance.
The following solution:
(1) Ensure data is sorted by site start times
(2) Use
shift()to look at the preceding row's end time (within site groups) and compare preceding end time with current row's start/end times, assigning a dummy variable if time overlap is true(3) Find the
Positionof the first TRUE overlap and add another dummy variable "is_simultaneous" if that first index is equal to the current row's index (within site groups)(4) Repeatedly remove any first index instances where observations are "is_simultaneous" until none are left. Here, a loop is used since OP requests that subsubsequent rows are removed only conditional on preceding rows having already been removed (and sample shows row 4's start time is equal to row 3's end time, which should be considered an overlap unless row 3 has already been removed)
Clean up the dummy variables
The result shows that the site_rank column created initially has rows 2 and 3 removed (and row 4 retained).