I would like to compare two mixed-type data frames and return the rows that are different between them--but I would like numeric values to only be returned within a certain percentage.
tbl1 <- tibble(var1 = c('r1', 'r2', 'r3', 'r4', 'r5'),
var2 = c('apple', 'orange', 'banana', 'strawberry', 'lime'),
var3 = c(1, 2, 3, 4, 5),
var4 = c('yes', 'no', 'yes', 'yes', 'no'))
tbl2 <- tibble(var1 = c('r6', 'r7', 'r8', 'r9', 'r10'),
var2 = c('orange', 'banana', 'apple', 'lemon', 'strawberry'),
var3 = c(2, 3, 1.5, 10, 4.1),
var4 = c('no', 'yes', 'yes', 'no', 'yes'))
I know there is dplyr::anti_join but that checks for exact matches. So if I was OK with numeric values that were within 20%, then the function would be something like:
tbl1 %>%
antijoin_function(tbl2, by = c('var2' = 'var2', 'var3' = 'var3', 'var4' = 'var4'),
pct = 0.2)
And return
| var1 | var2 | var3 | var4 |
|---|---|---|---|
| r1 | apple | 1 | yes |
| r5 | lime | 5 | no |
The row with strawberry would not be returned because the single difference in var3 is less than 20%.
Are there any functions or packages that do this?
Created on 2023-05-22 with reprex v2.0.2