Results from dplyr::join on UUID column differ between Linux and Mac OS

94 Views Asked by At

I am attempting to join 2 tibbles (dplyr::inner_join) based on a UUID column in each. Oddly, instead of the expected one-to-one relationship, the join produces a many-to-many relationship (as stated in a warning message and also indicated by the resulting tibble containing substantially more records than either of the original tibbles). In digging into the join results, it is apparent that the tibbles are being joined on unmatched UUIDs and that is producing the incorrect many-to-many relationship. Changing the data type of the UUID columns from UUID to character produces the correct one-to-one relationship (as indicated by the resulting tibble containing the same number of records as the two original tibbles).

I shared my code with a colleague and when he ran the code (i.e., inner joining by the UUID columns, data type for each column = UUID), it produced the expected one-to-one relationship. The only obvious difference between our respective computing environments is the OS: mine is Ubuntu 22.04.4 and his is Mac OS Ventura. We are both using RStudio 2023.12.1+402.

I'd appreciate any insights into what is causing this OS-related difference, and how to avoid the mis-joining of the UUID columns when using a Linux OS.

Here is a minimum reproducible example to demonstrate the issue:

# load packages

library(dplyr)
library(uuid)

# create 2 datasets

id1 <- UUIDgenerate(n = 100000, output = "uuid")
data1a <- tibble(id1, type_m = "aaaa")

data1b <- data1a %>% 
  rename(type_n = type_m) %>% 
  mutate(type_n = replace(type_n, type_n == "aaaa", "bbbb"))

# join on id1 column with "UUID" data type 

data1c <- data1a %>% 
  inner_join(data1b, by = join_by(id1), keep = TRUE)

# identify mis-joined records

data1c_troubleshoot <- data1c %>% filter(id1.x != id1.y)

# change data type of join_by column to "character" and join

data1a <- data1a %>% mutate(id1 = as.character(id1))
data1b <- data1b %>% mutate(id1 = as.character(id1))

data1c <- data1a %>% 
  inner_join(data1b, by = join_by(id1), keep = TRUE)

data1c_troubleshoot <- data1c %>% filter(id1.x != id1.y)
4

There are 4 best solutions below

0
Eizy On

I ran the above code on an AWS Linux/Ubuntu (18.04.1-Ubuntu) server with R version 4.0.2 (2020-06-22) and had the same issues/improper joining of UUID data type. A many to many relationship occurred.

0
Thomas C Smith On

I ran the above code on Windows 10, R version 4.3.2, and recreated the issue described in the original post. In addition, I adjusted the n in UUIDgenerate(): when n was reduced from 100,000 to 10,000, the issue persisted. Reduced to n=1,000, the same issue did not arise. Going further, it arose at n = 9,999, but not at 1,001.

0
Eizy On

I ran the code on my local device, MacOS Ventura 13.6.4 with R version 4.2.1, and had the same issues/improper joining of UUID data type. A many to many relationship occurred. If I adjust id1 <- UUIDgenerate(n = 999, output = "uuid") the join works properly.

0
mtnrambler On

I'm answering my own question based on what I've learned since the original posting. First, the observed incorrect joining based on UUID columns turned out not to be OS-related. In fact, the same issue occurs on Linux, Mac, and Windows (see comments). Second, the incorrect joining appears to be the result of dplyr not recognizing the UUID data type. As such, converting the UUID data type to character before joining solves the issue. The described issue could have been avoided if the dplyr documentation was clearer about what data types are and are not recognized.