ID pairing and unique pair count

33 Views Asked by At

I am writing a code in R which should analzye two columns P1 and P2 which both contain ID-code and the respective PAIR column.

enter image description here

  1. I want each individual ID-code to be only used once for a pair, but the individual ID-code can be within P1 and P2 (just in different rows).

  2. Further, I want to exclude logical duplicates. So, if a pair is looking like this "X30112_X30101" then it could be a duplicate from this "X30101_X30112"

  3. On the longrun I am actually looking for the maximum count of pairs which is quite tricky as each ID-code can only be used once but the data shows that a pairing of one individual ID code can be 1:n.

Unfortuenately, I am missing the experience to better describe and I think it might be a combinatorical solve. I would be happy for any kind of help.

What I tried so far?

So far I only tried successfully to solve 1) with an easier dataframe which somewhat worked with this code:

    # Sample data: df dataframe
    df <- data.frame(
      P1 = c("A", "B", "C", "W"),
      P2 = c("W", "X", "Y", "A"),
      PAIR = c("A_W", "B_X", "C_Y", "W_A")
    )

    # Function to normalize and sort pairs
    normalize_and_sort <- function(pair) {
      elements <- unlist(strsplit(pair, "[_\\.]"))
      sorted_pair <- paste(sort(elements), collapse = "_")
      return(sorted_pair)
     }

     # Normalize and sort the pairs and keep unique pairs
    unique_pairs_df <- data.frame(PAIR = unique(sapply(df$PAIR, normalize_and_sort)))

     # Print the unique_pairs_df
     print(unique_pairs_df)
  PAIR
1  A_W
2  B_X
3  C_Y

But this did not work with my actual dataframe. Maybe because my ID-codes use numbers, too.

1

There are 1 best solutions below

0
Gregor Thomas On

Your idea to sort the pairs is just right. With just 2, this is easy with vectorized pmin and pmax:

df$sorted_pair = with(df, paste(pmin(P1, P2), pmax(p1, p2), sep = "_"))

Then you can use any standard code to remove duplicates, like

df[!duplicated(df$sorted_pair), ]