Using combn to make specific functions for grouped pair-wise, row-wise comparisons

59 Views Asked by At

This is a small section of a dataset I'm working on.

dat2 <- read.table(text = "
   nodepair  V1  V2  V3  V4  V5  V6  V7  V8  V9 ES   
 1 A1_A1        0    21     0     0     0     0     0     0    78 45   
 2 A2_A1        0     0     0     0     0     0     0     0    99 45   
 3 A2_A2        0     1     0     0     0     0     0     0    98 45   
 4 A3_A1        0     0     0     0     0     6     1     3    89 45   
 5 A3_A2        0     0     0     0     0     0     0     0    99 45   
 6 A1_A1        0    20     0     0     0     0     0     0    65 46   
 7 A2_A1        0     0     0     0     0     0     0     0    85 46   
 8 A2_A2        0     1     0     0     0     0     0     0    84 46   
 9 A3_A1        0     0     0     0     2     6     3     3    71 46   
 10 A3_A2        0     0     0     0     0     0     0     0    85 46   
 11 A1_A1        0    25     0     0     0     0     0     0    45 47   
 12 A2_A1        0     0     0     0     0     0     0     0    70 47   
 13 A2_A2        0     1     0     0     0     0     0     0    69 47   
 14 A3_A1        0     0     0     0     0     8     0     1    61 47   
 15 A3_A2        0     0     0     0     0     0     0     0    70 47   
 16 A1_A1        0    37     0     0     0     0     0     0    77 48   
 17 A2_A1        0     0     0     0     0     0     0     0   114 48   
 18 A2_A2        0     0     0     0     0     0     0     0   114 48   
 19 A3_A1        0     0     0     0     2     9     0     3   100 48   
 20 A3_A2        0     0     0     0     0     0     0     0   114 48   
 ", header = TRUE)

I'm trying to write a program that will do all pairwise comparisons (grouped by the nodepair) across the 'ES' groups.

I'd like to write a series of functions to specifically compare each pair of rows. For example, when V1:V9 is > 0 for both ESs, this should result in 1, indicating presence of data.

I'm imagining the output to look something like this:

 dat3 <- read.table(text = "
    nodepair1 nodepair2  V1  V2  V3  V4  V5  V6  V7  V8  V9    
    A1_A1(45) A1_A1(46)   0     0    1     0     0     0     0     0     1        
  ", header = TRUE)

etc.

Unfortunately, I haven't gotten very far:

 dat2 <- dat2 %>%
   group_by(nodepair) %>%
   col2 = t(combn(nodepair,2)))

I'm pretty sure I need 'combn' here, but I'm very new to this function and can't figure it out.

2

There are 2 best solutions below

5
deschen On BEST ANSWER

Now with the TO having clarified their question, I'd propose the following solution:

library(tidyverse)

ES_combs <- combn(unique(dat2$ES), 2, simplify = FALSE)

dat2 |> 
  group_split(nodepair) |> 
  map(.x = _,
      .f = \(df) df |> 
        map(.x = 1:length(ES_combs),
            .f = ~df |> 
               filter(ES %in% ES_combs[[.x]]) |> 
               summarize(nodepair = first(nodepair),
                         ES_1 = ES[1],
                         ES_2 = ES[2],
                         across(V1:V9, ~as.numeric(all(. >0)))))) |> 
  bind_rows()

which gives:

# A tibble: 30 × 12
   nodepair  ES_1  ES_2    V1    V2    V3    V4    V5    V6    V7    V8    V9
   <chr>    <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 A1_A1       45    46     0     1     0     0     0     0     0     0     1
 2 A1_A1       45    47     0     1     0     0     0     0     0     0     1
 3 A1_A1       45    48     0     1     0     0     0     0     0     0     1
 4 A1_A1       46    47     0     1     0     0     0     0     0     0     1
 5 A1_A1       46    48     0     1     0     0     0     0     0     0     1
 6 A1_A1       47    48     0     1     0     0     0     0     0     0     1
 7 A2_A1       45    46     0     0     0     0     0     0     0     0     1
 8 A2_A1       45    47     0     0     0     0     0     0     0     0     1
 9 A2_A1       45    48     0     0     0     0     0     0     0     0     1
10 A2_A1       46    47     0     0     0     0     0     0     0     0     1
11 A2_A1       46    48     0     0     0     0     0     0     0     0     1
12 A2_A1       47    48     0     0     0     0     0     0     0     0     1
13 A2_A2       45    46     0     1     0     0     0     0     0     0     1
14 A2_A2       45    47     0     1     0     0     0     0     0     0     1
15 A2_A2       45    48     0     0     0     0     0     0     0     0     1
16 A2_A2       46    47     0     1     0     0     0     0     0     0     1
17 A2_A2       46    48     0     0     0     0     0     0     0     0     1
18 A2_A2       47    48     0     0     0     0     0     0     0     0     1
19 A3_A1       45    46     0     0     0     0     0     1     1     1     1
20 A3_A1       45    47     0     0     0     0     0     1     0     1     1
21 A3_A1       45    48     0     0     0     0     0     1     0     1     1
22 A3_A1       46    47     0     0     0     0     0     1     0     1     1
23 A3_A1       46    48     0     0     0     0     1     1     0     1     1
24 A3_A1       47    48     0     0     0     0     0     1     0     1     1
25 A3_A2       45    46     0     0     0     0     0     0     0     0     1
26 A3_A2       45    47     0     0     0     0     0     0     0     0     1
27 A3_A2       45    48     0     0     0     0     0     0     0     0     1
28 A3_A2       46    47     0     0     0     0     0     0     0     0     1
29 A3_A2       46    48     0     0     0     0     0     0     0     0     1
30 A3_A2       47    48     0     0     0     0     0     0     0     0     1

This probably needs a bit of explanation:

  • We start with creating all pairwise combinations of ES in your data frame and assign it to a list object ES_combs
  • We then take your data and split it by nodepair group into a list, where each list object is the data for one nodepair group.
  • We then initiate the outer map where we go through each group's data frame. It is important here to define an anonymous function, because we have an inner map, so we can't use the .x parameter twice.
  • The inner map takes each combination pair from ES_combs and filters the current group's data to these two rows. We then apply the summarize part.
  • As a last step, we use bind_rows to merge everything into a nice tibble instead of having an annoyingly long list.
2
Mark On

Here is how I would do it:

library(tidyverse)

# create a function which compares one combination of rows with each other
comparer <- function(comb) {
  bind_cols(
    nodepair1 = paste0(dat2[comb[1], 1], "(", dat2[comb[1], 11], ")"),
    nodepair2 = paste0(dat2[comb[2], 1], "(", dat2[comb[2], 11], ")"),
    dat2[comb[1], 2:10] > 0 & dat2[comb[2], 2:10] > 0
  )
}

combs <- combn(1:nrow(dat2), 2, simplify = FALSE)

# then apply that to each combination of rows for the dataset
map_df(combs, comparer)

# Output:
# A tibble: 190 × 11
   nodepair1 nodepair2 V1    V2    V3    V4    V5    V6    V7    V8    V9   
   <chr>     <chr>     <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
 1 A1_A1(45) A2_A1(45) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
 2 A1_A1(45) A2_A2(45) FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
 3 A1_A1(45) A3_A1(45) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
 4 A1_A1(45) A3_A2(45) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
 5 A1_A1(45) A1_A1(46) FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
 6 A1_A1(45) A2_A1(46) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
 7 A1_A1(45) A2_A2(46) FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
 8 A1_A1(45) A3_A1(46) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
 9 A1_A1(45) A3_A2(46) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
10 A1_A1(45) A1_A1(47) FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
# ℹ 180 more rows

Update: If you only want to compare ones with the same 'nodepair' value, you can do this:

processor <- function(x) {
  map_df(combn(x, 2, simplify = FALSE), comparer)
}

dat2 |>
  mutate(n = row_number()) |>
  summarise(comparisons = map(list(n), processor), .by = nodepair) |>
  unnest(comparisons)

# Output:
# A tibble: 30 × 12
   nodepair nodepair1 nodepair2 V1    V2    V3    V4    V5   
   <chr>    <chr>     <chr>     <lgl> <lgl> <lgl> <lgl> <lgl>
 1 A1_A1    A1_A1(45) A1_A1(46) FALSE TRUE  FALSE FALSE FALSE
 2 A1_A1    A1_A1(45) A1_A1(47) FALSE TRUE  FALSE FALSE FALSE
 3 A1_A1    A1_A1(45) A1_A1(48) FALSE TRUE  FALSE FALSE FALSE
 4 A1_A1    A1_A1(46) A1_A1(47) FALSE TRUE  FALSE FALSE FALSE
 5 A1_A1    A1_A1(46) A1_A1(48) FALSE TRUE  FALSE FALSE FALSE
 6 A1_A1    A1_A1(47) A1_A1(48) FALSE TRUE  FALSE FALSE FALSE
 7 A2_A1    A2_A1(45) A2_A1(46) FALSE FALSE FALSE FALSE FALSE
 8 A2_A1    A2_A1(45) A2_A1(47) FALSE FALSE FALSE FALSE FALSE
 9 A2_A1    A2_A1(45) A2_A1(48) FALSE FALSE FALSE FALSE FALSE
10 A2_A1    A2_A1(46) A2_A1(47) FALSE FALSE FALSE FALSE FALSE
# ℹ 20 more rows

Note: in all of these, it's using TRUE and FALSE. TRUE and FALSE are equivalent to 1 and 0 - indeed, you can see by doing TRUE + TRUE, which equals 2. I could wrap the dat2[comb[1], 2:10] > 0 & dat2[comb[2], 2:10] > 0 in as.integer, but I figured you would understand.