How can I find the number of identical matrices/data frames in my list

100 Views Asked by At

I have a list which contains 100 data frames. I need to find the number of identical dataframes in the list.

df1 <- data.frame(A = c(1, 1, 0, 1),
                  B = c(0, 0, 1, 1),
                  C = c(1, 1, 1, 0),
                  D = c(1, NA, 0, NA))

df2 <- data.frame(A = c(1, 1, 0, 1),
                  B = c(0, 0, 1, 1),
                  C = c(1, 1, 1, 0),
                  D = c(1, NA, NA, 0))

df3 <- data.frame(A = c(1, 1, 0, 1),
                  B = c(0, 0, 1, 1),
                  C = c(1, 1, 1, 0),
                  D = c(NA, 1, NA, 0))

df4 <- data.frame(A = c(1, 1, 0, 1),
                  B = c(0, 0, 1, 1),
                  C = c(1, 1, 1, 0),
                  D = c(NA, 1, NA, 0))

list1 <- list(df1, df2, df3, df4)
list1

As you see, df3 and df4 in the list are the same. Since the list consists of 4 objects, 6 different comparisons must be made.

4

There are 4 best solutions below

0
ThomasIsCoding On

If you just want to achieve grouping tags, you can simply run match + unique from base R

> match(list1, unique(list1))
[1] 1 2 3 3

Further more, to distinguish the groups, you can use split on top of the output from above, e.g.,

grp <- split(seq_along(list1), match(list1, unique(list1)))

then filter the group(s) having more than one entries, e.g.,

> unname(grp)[lengths(grp) > 1]
[[1]]
[1] 3 4
2
SamR On

Pairwise matrix of any duplicated elements

You can use outer() for this:

outer(list1, list1, Vectorize(identical))
#       [,1]  [,2]  [,3]  [,4]
# [1,]  TRUE FALSE FALSE FALSE
# [2,] FALSE  TRUE FALSE FALSE
# [3,] FALSE FALSE  TRUE  TRUE
# [4,] FALSE FALSE  TRUE  TRUE

Or to get the indices:

outer(list1, list1, Vectorize(identical)) |>
    `diag<-`(FALSE) |>
    which(arr.ind = TRUE)
#      row col
# [1,]   4   3
# [2,]   3   4

Indices of any duplicated elements

If you want to know the indices of any duplicated data frames (but not necessarily which elements they are identical to) you can create a small helper function using the base duplicated() function. Note that duplicated() compares each element to the previous element, so the first time an element that appears more than once is seen it is not considered duplicated. The helper function runs from the start and end (fromLast = TRUE) so we get the indices of all elements that appear more than once.

which_duplicated  <- function(l) {
    which(duplicated(l) | duplicated(l, fromLast = TRUE))
}
# [1] 3 4

List of unique data frames

If you just want to a list of the unique data frames you can do:

unique(list1)
# [[1]]
#   A B C  D
# 1 1 0 1  1
# 2 1 0 1 NA
# 3 0 1 1  0
# 4 1 1 0 NA

# [[2]]
#   A B C  D
# 1 1 0 1  1
# 2 1 0 1 NA
# 3 0 1 1 NA
# 4 1 1 0  0

# [[3]]
#   A B C  D
# 1 1 0 1 NA
# 2 1 0 1  1
# 3 0 1 1 NA
# 4 1 1 0  0
0
s_baldur On
library(digest) # We'll use hash strings instead of comparing data.frames

foo <- function(mylist) {
  grps <- split(seq_along(mylist), vapply(mylist, digest, character(1L)))
  list(number_unique = length(grps), identical = unname(grps)[lengths(grps)>1])
}

foo(list1)
# $number_unique
# [1] 3
# 
# $identical
# $identical[[1]]
# [1] 3 4

Benchmarking with 100 (small) data.frames

list2 <- lapply(1:100, \(i) iris[-sample(1:150, size = 1), ])

microbenchmark::microbenchmark(foo(list2), th1(list2), th2(list2))
# Unit: milliseconds
#        expr      min       lq       mean   median       uq      max neval
#  foo(list2)   2.6285   2.7871   5.983731   2.9289   3.4864  92.6672   100
1
jblood94 On

rlang provides a fast hash, and collapse provides fast grouping.

library(collapse) # for `qG`
library(rlang) # for `hash`

list1 <- list(df1, df2, df3, df4)
qG(vapply(list1, hash, ""), sort = FALSE)
#> [1] 1 2 3 3
#> attr(,"N.groups")
#> [1] 3
#> attr(,"class")
#> [1] "qG"

We can see that there are three unique objects in list1, and the third and fourth element are identical (they both belong to group 3).

Benchmarking.

library(digest)

list2 <- replicate(1e3, matrix(sample(sample(4, 1)*3), 3), FALSE)

microbenchmark::microbenchmark(
  toString = qG(vapply(list2, toString, ""), sort = FALSE),
  digest = qG(vapply(list2, digest, ""), sort = FALSE),
  rlang = qG(vapply(list2, hash, ""), sort = FALSE),
  check = "identical"
)
#> Unit: milliseconds
#>      expr     min       lq      mean   median       uq     max neval
#>  toString  8.5588  8.87315  9.597491  9.07445  9.54395 16.9515   100
#>    digest 21.3525 24.49680 29.048084 27.07445 30.48985 88.1293   100
#>     rlang  2.8603  3.09435  5.618076  5.54240  6.53210 15.8621   100