Apply criteria from one table against each row in another table to generate a list of matches

32 Views Asked by At

I have 2 tables of data. 1 table has a list of start and end date ranges. The other table a list of students with their class and start dates. For each of the date ranges, I'm trying to list the students who have a start date in that range.

Range Data

Start End
10/1/2023 12/31/2023
7/1/2023 9/30/2023
4/1/2023 6/30/2023
1/1/2023 3/31/2023

Student

Learner Course Start
1 English 11/5/23
1 Math 7/10/23
2 English 7/25/23
2 Math 5/15/23
3 Science 4/25/23
3 Math 7/25/23
3 English 11/15/23
4 Science 11/5/23
4 Math 1/5/23

If I only have a single date range to filter against, I can do this using filter() to filter by start dates, but I have a dozen different ranges and don't want to run in manually for each range. My intended outcome is a list of ranges with the students having a class in that range

Start End Learner
10/1/2023 12/31/2023 1
10/1/2023 12/31/2023 3
10/1/2023 12/31/2023 4
7/1/2023 9/30/2023 1
7/1/2023 9/30/2023 2
7/1/2023 9/30/2023 3
4/1/2023 6/30/2023 2
4/1/2023 6/30/2023 3
1/1/2023 3/31/2023 4
3

There are 3 best solutions below

0
Onyambu On
mutate(df1, across(everything(), mdy)) %>%
   right_join(mutate(df2, Start = mdy(Start)), 
             by = join_by(Start <= Start , End>Start), 
             suffix = c('', '_y'))%>%
   select(-Start_y, -Course)

       Start        End Learner
1 2023-10-01 2023-12-31       1
2 2023-10-01 2023-12-31       3
3 2023-10-01 2023-12-31       4
4 2023-07-01 2023-09-30       1
5 2023-07-01 2023-09-30       2
6 2023-07-01 2023-09-30       3
7 2023-04-01 2023-06-30       2
8 2023-04-01 2023-06-30       3
9 2023-01-01 2023-03-31       4
0
Andre Wildberg On

A base R approach. The tricky part here is not finding the matches but getting the desired format of repeated dates in separate rows.

data.frame(do.call(rbind, 
  apply(ranges, 1, \(x) 
    t(sapply(which(
      as.Date(student$Start, "%m/%d/%y") >= as.Date(x["Start"], "%m/%d/%Y") & 
      as.Date(student$Start, "%m/%d/%y") <= as.Date(x["End"], "%m/%d/%Y")), \(y)
        c(x, Learner=student$Learner[y]))))))
      Start        End Learner
1 10/1/2023 12/31/2023       1
2 10/1/2023 12/31/2023       3
3 10/1/2023 12/31/2023       4
4  7/1/2023  9/30/2023       1
5  7/1/2023  9/30/2023       2
6  7/1/2023  9/30/2023       3
7  4/1/2023  6/30/2023       2
8  4/1/2023  6/30/2023       3
9  1/1/2023  3/31/2023       4

Note: Be careful with years in dates. 2-digits years need %y, 4-digit years %Y.

Data

ranges <- structure(list(Start = c("10/1/2023", "7/1/2023", "4/1/2023", 
"1/1/2023"), End = c("12/31/2023", "9/30/2023", "6/30/2023", 
"3/31/2023")), class = "data.frame", row.names = c(NA, -4L))

student <- structure(list(Learner = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L, 4L
), Course = c("English", "Math", "English", "Math", "Science", 
"Math", "English", "Science", "Math"), Start = c("11/5/23", "7/10/23", 
"7/25/23", "5/15/23", "4/25/23", "7/25/23", "11/15/23", "11/5/23", 
"1/5/23")), class = "data.frame", row.names = c(NA, -9L))
0
jay.sf On

Write a function f that compares a range with a student's start date and loop over all combinations. In such double-for-loop situations we may use outer which gives a boolean matrix where we want max.col. Finally merge.

First however, care for your strange date formats and rename student's Start to Start_stud to have unique column names.

> ranges[] <- lapply(ranges, as.Date, '%m/%d/%Y')
> student <- transform(student, Start_stud=as.Date(student$Start, '%m/%d/%y'), Start=NULL)
> f <- \(x, y) student$Start[x] >= ranges[y, 1] & student$Start[x] <= ranges[y, 2]
> student$id <- max.col(outer(seq_len(nrow(student)), seq_len(nrow(ranges)), f))
> merge(ranges, student, by.x=0, by.y='id') |> {\(.) .[order(.[, 2]), -1]}()
       Start        End Learner  Course Start_stud
9 2023-01-01 2023-03-31       4    Math 2023-01-05
7 2023-04-01 2023-06-30       3 Science 2023-04-25
8 2023-04-01 2023-06-30       2    Math 2023-05-15
4 2023-07-01 2023-09-30       1    Math 2023-07-10
5 2023-07-01 2023-09-30       3    Math 2023-07-25
6 2023-07-01 2023-09-30       2 English 2023-07-25
1 2023-10-01 2023-12-31       1 English 2023-11-05
2 2023-10-01 2023-12-31       3 English 2023-11-15
3 2023-10-01 2023-12-31       4 Science 2023-11-05

If you really want just the first three columns, you can pipe in another |> subset(select=1:3).


Data:

> dput(ranges)
structure(list(Start = c("10/1/2023", "7/1/2023", "4/1/2023", 
"1/1/2023"), End = c("12/31/2023", "9/30/2023", "6/30/2023", 
"3/31/2023")), class = "data.frame", row.names = c(NA, -4L))
> dput(student)
structure(list(Learner = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L, 4L
), Course = c("English", "Math", "English", "Math", "Science", 
"Math", "English", "Science", "Math"), Start = c("11/5/23", "7/10/23", 
"7/25/23", "5/15/23", "4/25/23", "7/25/23", "11/15/23", "11/5/23", 
"1/5/23")), class = "data.frame", row.names = c(NA, -9L))