How can I make this matching function faster in R? It currently takes 6-7 days, and this is not practical

Question

How can I make this matching function faster in R? It currently takes 6-7 days, and this is not practical

76 Views Asked by Steph At 15 March 2024 at 18:01

I have two data files to start: one is a cohort of exposed individuals (100000 rows) and the other is a general population cohort spanning a 5-year period (~3 million rows). I am trying to create a matching function that, for each individual in my exposed cohort, will randomly select 5 individuals, matched on age and sex, from the general population cohort. The selection of 5 matches between exposed individuals is WITH replacement.These randomly selected individuals will then populate a third data.table. I was previously trying to use the matchit package, but it is not able to do exactly what I need, which is why I am trying to create this code from scratch.

The code I have written works, but because it is a for loop and i have very large sample sizes, it is taking too long and it is not a practical solution. If you have any ideas of how to speed this up, please help!

The code I have right now is the following:

find_matches <- function(exposed.cohort, unexposed.cohort) {
#create an empty list to store the matches 
 matched.data <- data.table() 

#iterate over each row to find matches 
  for (i in 1:nrow(exposed.cohort)) {
    exposed_person <- exposed.cohort[i]
    potential_matches <- unexposed.cohort[xxxxxxxx here is a long logical statement of which conditions the potential matches need to meet to be selected xxxxxx]

    #randomly sample 5 without replacement 
    if (nrow(potential_matches)) > 5 {
      matched_data <- potential_matches[sample(.N,5),]}
    else {
      matched_data <- potential_matches 
    }

    #add identifier 
    matched_data[, matchID := exposed_person$ID]

    #store results
    matched.data <- rbind(matched.data, matched_data)
    i <- i+1 
  }
  return(matched.data)
}

The above code works, but I need to write it in a way that will speed up the process (if possible). In addition, is there a way that I can see the matched.data file output as its running, so that in case R is interrupted/crashes/I have to stop it prematurely, I will still have the progress in the matched.data output saved in the global environment and I don't have to completely restart?

Any help would be very appreciated, I'm on a tight deadline and I'm starting to freak out a little bit! Thank you.

EDIT: Okay, I oversimplified quite a bit, so let me give a little more detail about the matching requirements:

my matching is not technically only on age and sex. I am also anti-exact matching on the ID variable (to make sure i don't sample the same person to match themselves) because my data is over a 5-year time period, and the people who are in my exposed cohort (the exposure period is quite short) can technically be considered unexposed at other time periods, so these same people are also in my general population cohort. Not only that, but there are some longitudinal periods in time where individuals are no longer able to be selected as a match because they have died, or they became exposed themselves, for example. I have some date variables in my datasets that indicate the time periods that someone is not available to be selected as a match.

Here is what my code actually looks like (here i only showed 3 potential time periods of "unavailability" but there are actually 7).

find_matches <- function(exposed.cohort, unexposed.cohort) {
#create an empty list to store the matches 
 matched.data <- data.table() 

#iterate over each row to find matches 
  for (i in 1:nrow(exposed.cohort)) {
    exposed_person <- exposed.cohort[i]
    potential_matches <- unexposed.cohort[   for (i in 1:nrow(exposed.cohort)) {      exposed_person <- exposed.cohort[i]    
> potential_matches <- unexposed.cohort[birthyear ==
> exposed_person$birthyear & birthmonth == exposed_person$birthmonth &
> IDVariable != exposed_person$ID & (!(exposed_person$exposuredate >=
> unavailable_start1 & exposed_person$exposuredate <= unavailable_end1)
> | (is.na(unavailable_start1) & is.na(unavailable_end1))) &
> (!(exposed_person$exposuredate >= unavailable_start2 &
> exposed_person$exposuredate <= unavailable_end2) |
> (is.na(unavailable_start2) & is.na(unavailable_end2))) &
> (!(exposed_person$exposuredate >= unavailable_start3 &
> exposed_person$exposuredate <= unavailable_end3) |
> (is.na(unavailable_start3) & is.na(unavailable_end3)))]

    #randomly sample 5 without replacement 
    if (nrow(potential_matches)) > 5 {
      matched_data <- potential_matches[sample(.N,5),]}
    else {
      matched_data <- potential_matches 
    }

    #add identifier 
    matched_data[, matchID := exposed_person$ID]

    #store results
    matched.data <- rbind(matched.data, matched_data)
    i <- i+1 
  }
  return(matched.data)
}

Original Q&A

There are 1 best solutions below

**Jon Spring** · Answer 1 · 2024-03-15T18:13:39.617000

Here's a faster approach (approx 500,000x faster, if we're going 6 days) that takes about 1-2 seconds on my machine to do a match with the size of your problem. In general, any time you want to make a match between two tables, it's worth considering if there's a way to do it with a join, which are highly optimized for speed, and often simpler to implement than custom methods.

Here's some fake data to test on:

set.seed(42)
n_e = 1e5
n_g = 3e6
exposed <- data.frame(id = 1:n_e, 
                      age = sample(18:100, n_e, TRUE, prob = 100:18),
                      sex = sample(c("F","M"), n_e, TRUE))

genpop <- data.frame(id = 1:n_g, 
                     some_value = sample(0:1000, size = n_g, TRUE),
                      age = sample(18:100, n_g, TRUE, prob = 100:18),
                      sex = sample(c("F","M"), n_g, TRUE))

Here, I make 5 copies of the exposed data (one for each individual to match), then assign a random number from 0-1 to every row. Then I join to the general population data, where I've added an equivalent random number, matching on age, sex, and the closest next larger random number. Matching on a random number gives us a deterministic (depending on the random number seed) way to get a match for each record that is pseudorandomly drawn with replacement from the available matching records.

library(tidyverse)
exposed2 <- exposed |>
  uncount(5) |>
  mutate(match_val = runif(n()))

genpop2 <- genpop |> 
  mutate(match_val = runif(n()))

matches <- exposed2 |>
  left_join(genpop2, 
            join_by(age, sex, closest(match_val >= match_val))) |> 
  mutate(row = row_number())

Result (takes <1 sec on my machine)

    id.x age sex match_val.x    id.y some_value match_val.y row
1      1  84   M 0.895113843   53456        649 0.894986516   1
2      1  84   M 0.211191925  909691        437 0.211068835   2
3      1  84   M 0.928751770   62517        405 0.928730818   3
4      1  84   M 0.019214149 1909904        396 0.019203455   4
5      1  84   M 0.656367009 2890747        119 0.656271973   5
6      2  88   F 0.761659719 1524497        895 0.761632414   6
7      2  88   F 0.466713314  661169        586 0.466501196   7
8      2  88   F 0.206409016 2288869        809 0.206378279   8
9      2  88   F 0.588333740 1313506        936 0.588274228   9
10     2  88   F 0.747370376 2270974        730 0.747310436  10
...

This approach has a flaw, in that it's possible the random number generated in the first table will be smaller than any matching value with the same age/sex in the second table. For instance, with the settings above, I get 30 non-matches, starting with rows below:

matches |>
  filter(is.na(some_value))

    id.x age sex  match_val.x id.y some_value match_val.y    row
1   3321  98   M 1.503364e-05   NA         NA          NA  16604
2   5737  24   F 6.728130e-05   NA         NA          NA  28684
3   6809  53   F 1.502037e-05   NA         NA          NA  34044
4  10023  54   M 4.148111e-06   NA         NA          NA  50113
...

A quick bandaid for this would be to go back and match those entries using a join in the opposite direction. Combining with the matching data from the first round, we can get a full solution, now in about 2 seconds:

# ... after "matches" is created
matches_fixed <- bind_rows(
  matches |> filter(!is.na(some_value)), # ok joins from first 
  matches |> filter(is.na(some_value)) |> # fixed broken joins
    select(row, age, sex, id = id.x, match_val = match_val.x) |>
    left_join(genpop2, 
              join_by(age, sex, closest(match_val <= match_val)))) |>
  arrange(row)

(I'd be eager to learn if there is a more elegant fix for that.)

How can I make this matching function faster in R? It currently takes 6-7 days, and this is not practical

There are 1 best solutions below

Related Questions in R

Related Questions in FUNCTION

Related Questions in PERFORMANCE

Related Questions in FOR-LOOP

Related Questions in BIGDATA

Trending Questions

Popular # Hahtags

Popular Questions