Filtering out rows if a start date lies between a start and end date in another row

47 Views Asked by At

I have one dataframe with a grouping ID ('Site location') where different events have happened between a 'start date_time' and an 'end date_time'. I want to figure out how I can remove rows where an event's start date_time is earlier than or equal to the end date_time of an earlier event that happened at that site location. Or in other words, how can I remove an event that already occurs within higher up in the dataframe (with the same 'Site location').

Some example data:

df <- data.frame("Site location" = rep(c("A","B"),each = 5),
             "Start date_time" = as.POSIXct(c("2022-01-01 00:00","2022-01-01 00:45","2022-01-01 01:00","2022-01-01 03:00","2022-01-01 04:00",
                                         "2022-01-01 00:00","2022-01-01 01:00","2022-01-01 02:00","2022-01-01 03:00","2022-01-01 04:00")),
             "End date_time" = as.POSIXct(c("2022-01-01 01:00","2022-01-01 01:15","2022-01-01 03:00","2022-01-01 03:15","2022-01-01 05:00",
                                         "2022-01-01 00:45","2022-01-01 01:45","2022-01-01 02:45","2022-01-01 03:45","2022-01-01 04:45")))

So in this example, I would like to remove rows 2 and 3 because their start date/time is earlier or equal to the end date/time of row 1. The removal of row 3 means that row 4 should be kept in (if row 3 wasn't removed, I'd want this filtered out as the start date of row 4 == the end date of row 3). All other rows can be kept as there are no other cross overs.

Many thanks in advance.

2

There are 2 best solutions below

0
Soren On

The following solution:

(1) Ensure data is sorted by site start times

(2) Use shift() to look at the preceding row's end time (within site groups) and compare preceding end time with current row's start/end times, assigning a dummy variable if time overlap is true

(3) Find the Position of the first TRUE overlap and add another dummy variable "is_simultaneous" if that first index is equal to the current row's index (within site groups)

(4) Repeatedly remove any first index instances where observations are "is_simultaneous" until none are left. Here, a loop is used since OP requests that subsubsequent rows are removed only conditional on preceding rows having already been removed (and sample shows row 4's start time is equal to row 3's end time, which should be considered an overlap unless row 3 has already been removed)

library(data.table)
setDT(df)
setorder(df,
         Site.location,
         Start.date_time,
         End.date_time)

df[,site_rank:=.I,
   by=.(Site.location)]

repeat {
  df[,
     time_overlaps:= shift(End.date_time,
                           n=1,
                           fill=NA,
                           type="lag") %between% list(Start.date_time,End.date_time),
     by=.(Site.location)]
  
  df[,
     is_simultaneous:=Position(f=isTRUE,
                               time_overlaps,
                               nomatch=0)==.I,
     by=.(Site.location)]

  if (!any(df$is_simultaneous)) {
    break;
  } else { 
    df <- df[is_simultaneous==FALSE]
  }
}

Clean up the dummy variables

 df[,
       `:=`(time_overlaps=NULL,
            is_simultaneous=NULL)]

The result shows that the site_rank column created initially has rows 2 and 3 removed (and row 4 retained).

1
Wimpel On

I believe this data.table approach shpuld to the trick

library(data.table)
# set to data.table format
setDT(df)
# add a rowid
df[, id := .I]
# see if overlap occurs (perform join to identify overlapping rows)
to.be.removed <- df[df, on = .(Site.location, Start.date_time < End.date_time)][id > i.id, id]
# final
df[!id %in% to.be.removed, ]

   Site.location     Start.date_time       End.date_time    id
          <char>              <POSc>              <POSc> <int>
1:             A 2022-01-01 00:00:00 2022-01-01 01:00:00     1
2:             A 2022-01-01 03:00:00 2022-01-01 03:15:00     4
3:             A 2022-01-01 04:00:00 2022-01-01 05:00:00     5
4:             B 2022-01-01 00:00:00 2022-01-01 00:45:00     6
5:             B 2022-01-01 01:00:00 2022-01-01 01:45:00     7
6:             B 2022-01-01 02:00:00 2022-01-01 02:45:00     8
7:             B 2022-01-01 03:00:00 2022-01-01 03:45:00     9
8:             B 2022-01-01 04:00:00 2022-01-01 04:45:00    10