I'm running n_distinct on a large file (>30GB) and it doesn't appear to produce an exact result.
I have another reference point for the data, and the output is off in the disk frame aggregate.
It mentions in the docs that n_distinct is an exact calculation, not an estimate.
Is that right?
The implementation of
n_distinctcan be found on this page https://github.com/xiaodaigh/disk.frame/blob/master/R/one-stage-verbs.RNow, it looks to be an exact calculation as I intended. The logic is simple, it computes the
uniquewithin each chunk, and thenn_distincton result of all chunks once collected.But I can't rule out if there is a bug elsewhere.
Do you have test cases to show that it's not exactly? Perhaps you can contribute a PR to test?