I have two data files to start: one is a cohort of exposed individuals (100000 rows) and the other is a general population cohort spanning a 5-year period (~3 million rows). I am trying to create a matching function that, for each individual in my exposed cohort, will randomly select 5 individuals, matched on age and sex, from the general population cohort. The selection of 5 matches between exposed individuals is WITH replacement.These randomly selected individuals will then populate a third data.table. I was previously trying to use the matchit package, but it is not able to do exactly what I need, which is why I am trying to create this code from scratch.
The code I have written works, but because it is a for loop and i have very large sample sizes, it is taking too long and it is not a practical solution. If you have any ideas of how to speed this up, please help!
The code I have right now is the following:
find_matches <- function(exposed.cohort, unexposed.cohort) {
#create an empty list to store the matches
matched.data <- data.table()
#iterate over each row to find matches
for (i in 1:nrow(exposed.cohort)) {
exposed_person <- exposed.cohort[i]
potential_matches <- unexposed.cohort[xxxxxxxx here is a long logical statement of which conditions the potential matches need to meet to be selected xxxxxx]
#randomly sample 5 without replacement
if (nrow(potential_matches)) > 5 {
matched_data <- potential_matches[sample(.N,5),]}
else {
matched_data <- potential_matches
}
#add identifier
matched_data[, matchID := exposed_person$ID]
#store results
matched.data <- rbind(matched.data, matched_data)
i <- i+1
}
return(matched.data)
}
The above code works, but I need to write it in a way that will speed up the process (if possible). In addition, is there a way that I can see the matched.data file output as its running, so that in case R is interrupted/crashes/I have to stop it prematurely, I will still have the progress in the matched.data output saved in the global environment and I don't have to completely restart?
Any help would be very appreciated, I'm on a tight deadline and I'm starting to freak out a little bit! Thank you.
EDIT: Okay, I oversimplified quite a bit, so let me give a little more detail about the matching requirements:
my matching is not technically only on age and sex. I am also anti-exact matching on the ID variable (to make sure i don't sample the same person to match themselves) because my data is over a 5-year time period, and the people who are in my exposed cohort (the exposure period is quite short) can technically be considered unexposed at other time periods, so these same people are also in my general population cohort. Not only that, but there are some longitudinal periods in time where individuals are no longer able to be selected as a match because they have died, or they became exposed themselves, for example. I have some date variables in my datasets that indicate the time periods that someone is not available to be selected as a match.
Here is what my code actually looks like (here i only showed 3 potential time periods of "unavailability" but there are actually 7).
find_matches <- function(exposed.cohort, unexposed.cohort) {
#create an empty list to store the matches
matched.data <- data.table()
#iterate over each row to find matches
for (i in 1:nrow(exposed.cohort)) {
exposed_person <- exposed.cohort[i]
potential_matches <- unexposed.cohort[ for (i in 1:nrow(exposed.cohort)) { exposed_person <- exposed.cohort[i]
> potential_matches <- unexposed.cohort[birthyear ==
> exposed_person$birthyear & birthmonth == exposed_person$birthmonth &
> IDVariable != exposed_person$ID & (!(exposed_person$exposuredate >=
> unavailable_start1 & exposed_person$exposuredate <= unavailable_end1)
> | (is.na(unavailable_start1) & is.na(unavailable_end1))) &
> (!(exposed_person$exposuredate >= unavailable_start2 &
> exposed_person$exposuredate <= unavailable_end2) |
> (is.na(unavailable_start2) & is.na(unavailable_end2))) &
> (!(exposed_person$exposuredate >= unavailable_start3 &
> exposed_person$exposuredate <= unavailable_end3) |
> (is.na(unavailable_start3) & is.na(unavailable_end3)))]
#randomly sample 5 without replacement
if (nrow(potential_matches)) > 5 {
matched_data <- potential_matches[sample(.N,5),]}
else {
matched_data <- potential_matches
}
#add identifier
matched_data[, matchID := exposed_person$ID]
#store results
matched.data <- rbind(matched.data, matched_data)
i <- i+1
}
return(matched.data)
}
Here's a faster approach (approx 500,000x faster, if we're going 6 days) that takes about 1-2 seconds on my machine to do a match with the size of your problem. In general, any time you want to make a match between two tables, it's worth considering if there's a way to do it with a join, which are highly optimized for speed, and often simpler to implement than custom methods.
Here's some fake data to test on:
Here, I make 5 copies of the exposed data (one for each individual to match), then assign a random number from 0-1 to every row. Then I join to the general population data, where I've added an equivalent random number, matching on age, sex, and the closest next larger random number. Matching on a random number gives us a deterministic (depending on the random number seed) way to get a match for each record that is pseudorandomly drawn with replacement from the available matching records.
Result (takes <1 sec on my machine)
This approach has a flaw, in that it's possible the random number generated in the first table will be smaller than any matching value with the same age/sex in the second table. For instance, with the settings above, I get 30 non-matches, starting with rows below:
A quick bandaid for this would be to go back and match those entries using a join in the opposite direction. Combining with the matching data from the first round, we can get a full solution, now in about 2 seconds:
(I'd be eager to learn if there is a more elegant fix for that.)