Finding Overlapping Dates for Matching Brands in R

63 Views Asked by At

I got an R data frame and tried to find rows where, simultaneously, the entries of the column data_recalls$Attributed_Brand match and the time periods overlap. The time periods can be found in the columns Before_Date_Recall and After_Date_Recall. Possible matches would look like this:

(1) data_recalls$Attributed_Brand: Nike; Before_Date_Recall: 2018-09-22; After_Date_Recall: 2018-10-21

(2) data_recalls$Attributed_Brand: Nike; Before_Date_Recall: 2018-10-19; After_Date_Recall: 2018-10-24

Unfortunately, I can only come up with a solution where the code looks for matches in the column data_recalls$Attributed_Brand and, after that, for overlapping dates, which does not create the results needed. My data frame looks like this

data_recalls <- data.frame(Attributed_Brand = c("Nike", "Adidas", "Nike", "Puma"),
                       Before_Date_Recall = c("2018-09-22", "2018-09-20", "2018-10-19", "2018-11-01"),
                       After_Date_Recall = c("2018-10-21", "2018-10-24", "2018-10-31", "2018-11-10"),
                       stringsAsFactors = FALSE)

Thanks for any suggestions!

2

There are 2 best solutions below

2
Davis Vaughan On

Sounds like the ivs package would work for you. The groups column allows you to identify overlapping rows within each brand name.

library(ivs)
library(dplyr, warn.conflicts = FALSE)

data_recalls <- tibble(
  Attributed_Brand = c(
    "Nike", "Nike", "Nike", 
    "Adidas", "Adidas", "Adidas", "Adidas"
  ),
  Before_Date_Recall = as.Date(c(
    "2018-09-22", "2018-10-19", "2018-10-26",
    "2018-09-01", "2018-10-05", "2018-09-03", "2018-10-02"
  )),
  After_Date_Recall = as.Date(c(
    "2018-10-21", "2018-10-24", "2018-10-30",
    "2018-09-05", "2018-10-15", "2018-09-07", "2018-10-10"
  ))
)

data_recalls %>%
  mutate(Date_Recall = iv(Before_Date_Recall, After_Date_Recall), .keep = "unused") %>%
  mutate(group = iv_identify_group(Date_Recall), .by = Attributed_Brand) %>%
  filter(n() > 1, .by = c(Attributed_Brand, group))
#> # A tibble: 6 × 3
#>   Attributed_Brand              Date_Recall                    group
#>   <chr>                          <iv<date>>               <iv<date>>
#> 1 Nike             [2018-09-22, 2018-10-21) [2018-09-22, 2018-10-24)
#> 2 Nike             [2018-10-19, 2018-10-24) [2018-09-22, 2018-10-24)
#> 3 Adidas           [2018-09-01, 2018-09-05) [2018-09-01, 2018-09-07)
#> 4 Adidas           [2018-10-05, 2018-10-15) [2018-10-02, 2018-10-15)
#> 5 Adidas           [2018-09-03, 2018-09-07) [2018-09-01, 2018-09-07)
#> 6 Adidas           [2018-10-02, 2018-10-10) [2018-10-02, 2018-10-15)

And here is one other alternate way that may be faster depending on the structure of your actual dataset

data_recalls %>%
  mutate(Date_Recall = iv(Before_Date_Recall, After_Date_Recall), .keep = "unused") %>%
  mutate(count = iv_count_overlaps(Date_Recall, Date_Recall), .by = Attributed_Brand) %>%
  filter(count > 1)
#> # A tibble: 6 × 3
#>   Attributed_Brand              Date_Recall count
#>   <chr>                          <iv<date>> <int>
#> 1 Nike             [2018-09-22, 2018-10-21)     2
#> 2 Nike             [2018-10-19, 2018-10-24)     2
#> 3 Adidas           [2018-09-01, 2018-09-05)     2
#> 4 Adidas           [2018-10-05, 2018-10-15)     2
#> 5 Adidas           [2018-09-03, 2018-09-07)     2
#> 6 Adidas           [2018-10-02, 2018-10-10)     2
2
r2evans On

Try this:

sqldf::sqldf("
  select distinct t1.*
  from data_recalls t1
    inner join data_recalls t2 on t1.Attributed_Brand = t2.Attributed_Brand
      and ((t1.Before_Date_Recall > t2.Before_Date_Recall and t1.Before_Date_Recall < t2.After_Date_Recall)
           or (t1.After_Date_Recall > t2.Before_Date_Recall and t1.After_Date_Recall < t2.After_Date_Recall))")
#   Attributed_Brand Before_Date_Recall After_Date_Recall
# 1             Nike         2018-09-22        2018-10-21
# 2             Nike         2018-10-19        2018-10-31

I tried first using sql's between, but that's inclusive, which causes every row to match itself. We could likely safeguard against that with some added portions, but this is a start.

In order to allow overlap, your data needs to have a unique id for each row:

data_recalls$id <- seq_len(nrow(data_recalls))

From there, we can use between and then add inequality of the id field.

sqldf::sqldf("
  select distinct t1.*
  from data_recalls t1
    inner join data_recalls t2 on t1.Attributed_Brand = t2.Attributed_Brand
      and (t1.Before_Date_Recall between t2.Before_Date_Recall and t2.After_Date_Recall
           or t1.After_Date_Recall between t2.Before_Date_Recall and t2.After_Date_Recall)
      and t1.id <> t2.id")
#   Attributed_Brand Before_Date_Recall After_Date_Recall id
# 1             Nike         2018-09-22        2018-10-21  1
# 2             Nike         2018-10-19        2018-10-31  3