I have my data in two data tables as below (with many more columns than just shown here) -
DataTable 1 = data_sale
| Site Id | Country | Product ID |
|---|---|---|
| 1000375476 | Canada | UG10000-WISD |
| 1000375476 | Canada | UGD12895 |
| 1000706152 | Switzerland | UG10000-WISD |
| 1000706152 | Switzerland | UG80000-NTCD-G |
| 1000797366 | Italy | UG10000-WISD |
| 1000797366 | Italy | UG12210 |
DataTable 2 = data_licenses
| Site Id | Country | Product ID |
|---|---|---|
| 1000375476 | Canada | UG10000-WISD |
| 1000375476 | Canada | UGD12895 |
| 1000797366 | Italy | UG12785 |
| 1000797366 | Italy | UG12210 |
I want to calculate the set difference for unique Product ID for all the Site Id in data_sale, keeping all rows.
Here is what I've done so far -
- For both of the data tables, I've created a new column with all unique products in it.
data_sale <-
data_sale[, `unique_products` := paste0(unique(`Product ID`), collapse = ","),
keyby = c("Site Id")]
data_licenses <-
data_licenses[, .(`unique_products` = paste0(unique(`Product ID`), collapse = ",")),
keyby = c("Site Id")]
- Left Merged data_sale with data_licenses
merge(data_sale, data_licenses, by = 'Site Id', all.x = TRUE)
Now the merged datatable look like this -
| Site Id | Country | Product ID | unique_products.data_sale | unique_products.data_licenses |
|---|---|---|---|---|
| 1000375476 | Canada | UG10000-WISD | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 |
| 1000375476 | Canada | UGD12895 | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 |
| 1000706152 | Switzerland | UG10000-WISD | UG10000-WISD,UG80000-NTCD-G | NA |
| 1000706152 | Switzerland | UG80000-NTCD-G | UG10000-WISD,UG80000-NTCD-G | NA |
| 1000797366 | Italy | UG10000-WISD | UG10000-WISD,UG12210 | UG12785,UG12210 |
| 1000797366 | Italy | UG12210 | UG10000-WISD,UG12210 | UG12785,UG12210 |
The problem is with my final step where I want a new column showing difference between the products of data_sale and data_licenses, it should look like this -
| Site Id | Country | Product ID | unique_products.data_sale | unique_products.data_licenses | difference |
|---|---|---|---|---|---|
| 1000375476 | Canada | UG10000-WISD | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 | NA |
| 1000375476 | Canada | UGD12895 | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 | NA |
| 1000706152 | Switzerland | UG10000-WISD | UG10000-WISD,UG80000-NTCD-G | NA | UG10000-WISD,UG80000-NTCD-G |
| 1000706152 | Switzerland | UG80000-NTCD-G | UG10000-WISD,UG80000-NTCD-G | NA | UG10000-WISD,UG80000-NTCD-G |
| 1000797366 | Italy | UG10000-WISD | UG10000-WISD,UG12210 | UG12785,UG12210 | UG10000-WISD |
| 1000797366 | Italy | UG12210 | UG10000-WISD,UG12210 | UG12785,UG12210 | UG10000-WISD |
Any leads on how it can be achieved will be of great help. Thanks!
Below is the data using dput() for the merged datatable
structure(list(`Site Id` = c("1000375476", "1000375476", "1000706152",
"1000706152", "1000797366", "1000797366"), Country = c("Canada",
"Canada", "Switzerland", "Switzerland", "Italy", "Italy"), `Product ID` = c("UG10000-WISD",
"UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD",
"UG12210"), unique_products.x = c("UG10000-WISD,UGD12895", "UG10000-WISD,UGD12895",
"UG10000-WISD,UG80000-NTCD-G", "UG10000-WISD,UG80000-NTCD-G",
"UG10000-WISD,UG12210", "UG10000-WISD,UG12210"), unique_products.y = c("UG10000-WISD,UGD12895",
"UG10000-WISD,UGD12895", NA, NA, "UG12785,UG12210", "UG12785,UG12210"
)), sorted = "Site Id", class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x556bb5c10a40>)
This will get the products in
data_salethat are not indata_licensebySite Id. Instead of concatenating the unique product IDs, it's easier to work with the unique columns as character vectors.