Count distinct in a rxSummary

327 Views Asked by At

I want to count distinct values of var2 grouping by var1 in a .xdf file,

I tried something like this

 myFun <- function(dataList) {
    UniqueLevel <<- unique(c(UniqueLevel, dataList$var2))
    SumUniqueLevel <<- length(UniqueLevel)
    return(NULL)
    }

rxSummary(formula = ~ var1,
data = "DefModelo2.xdf",
transformFunc = myFun,
transformObjects = list(UniqueLevel = NULL),
removeZeroCounts = F)

Thank you in advance

EDIT:

Probably using RevoPemaR is the the faster way

2

There are 2 best solutions below

1
Derek McCrae Norton On BEST ANSWER

One other option is to use rxCrossTabs. This way you get a cross-tabulation of the two factors, and you can just count non zero entries to determine unique values by one of the factors.

censusWorkers <- file.path(rxGetOption("sampleDataDir"), "CensusWorkers.xdf")
censusXtabAge <- rxCrossTabs(~ F(age):F(wkswork1), data = censusWorkers, 
                             removeZeroCounts = FALSE, returnXtabs = TRUE)
apply(censusXtabAge != 0, MARGIN = 1, sum)
0
Hong Ooi On

Split by var1, and then for each group, count up the unique values of var2. This assumes that var1 and var2 are factors, if they're not you'll have to run rxFactors first.

xdflst <- rxSplit(xdf, splitByVars="var1", varsToKeep=c("var1", "var2"))

out <- rxExec(function(grp) {
        var1 <- head(grp, 1)$var1
        var2 <- rxDataStep(grp, varsToKeep="var2")$var2
        data.frame(var2, distinct=length(unique(var2)))
    },
    grp=rxElemArg(xdflst))

do.call(rbind, out)

Or you could get my dplyrXdf package and use a dplyr group_by/summarise pipeline (which basically does all the above, including converting to factors if necessary):

xdf %>% group_by(var1) %>%
    summarise(distinct=n_distinct(var2),
              .rxArgs=list(varsToKeep=c("var1", "var2")))