Is n_distinct an exact calculation with disk frames?

44 Views Asked by At

I'm running n_distinct on a large file (>30GB) and it doesn't appear to produce an exact result.

I have another reference point for the data, and the output is off in the disk frame aggregate.

It mentions in the docs that n_distinct is an exact calculation, not an estimate.

Is that right?

1

There are 1 best solutions below

0
xiaodai On BEST ANSWER

The implementation of n_distinct can be found on this page https://github.com/xiaodaigh/disk.frame/blob/master/R/one-stage-verbs.R

#' @export
#' @rdname one-stage-group-by-verbs
n_distinct_df.chunk_agg.disk.frame <- function(x, na.rm = FALSE, ...) {
  if(na.rm) {
    setdiff(unique(x), NA)
  } else {
    unique(x)
  }
}

#' @export
#' @importFrom dplyr n_distinct
#' @rdname one-stage-group-by-verbs
n_distinct_df.collected_agg.disk.frame <- function(listx, ...) {
  n_distinct(unlist(listx))
}

Now, it looks to be an exact calculation as I intended. The logic is simple, it computes the unique within each chunk, and then n_distinct on result of all chunks once collected.

But I can't rule out if there is a bug elsewhere.

Do you have test cases to show that it's not exactly? Perhaps you can contribute a PR to test?