Is there a good way to find out the length of entire data that one is working on in data.table, like with .N that returns the length of the group?
I give you an example:
library(data.table)
dt <- as.data.table(iris)
dt$Species[1:10] <- NA
dt[
!is.na(Species),
list(
proportion = .N / 140,
sepalwidthmean = mean(Sepal.Width)
),
by = Species
]
dt[
!is.na(Species),
list(
proportion = .N / sum(!is.na(dt$Species)),
sepalwidthmean = mean(Sepal.Width)
),
by = Species
]
As you can see, it takes quite some amount of typing to get the 140 (150-10) that is essentially redundant (already done in the I-argument of my query).
The way I did it feels a bit scary to me, because normally, data.table will look at the names of the columns instead of the names of the global scope of variables. So
dt$dt <- "dt"
dt[
!is.na(Species),
list(
proportion = .N / sum(!is.na(dt$Species)),
sepalwidthmean = mean(Sepal.Width),
lalala = mean(`dt$Species`)
),
by = Species
]
and you get Error in dt$Species : $ operator is invalid for atomic vectors. Is there an elegant way of doing it right?
It feels a bit bit like magic that normally, everything works out correctly:
dt$mean <- "mean"
dt[,mean(Sepal.Length)]
gives 5.843333 instead of an error. So maybe you can see why I want to have a better way than the one I am currently using.
You can add the
proportioncolumn in a second chained step.Edit: On a more attentive reading, the question is really about whether, given a
data.tablequery with aby(and potentially also ani) argument present, there is a neater way of getting the denominator in <group_N> / <overall_N> than re-referencing (and re-filtering) the table, in this case via e.g..N / sum(!is.na(dt$Species)). This is rather cumbersome when the data are filtered iniand a simple.N / nrow(dt)won't work.The answer above, i.e.
.Nfor <group_N> thensum(N)for <overall_N> in a second step, is what I use for this very common pattern (and I doubt I'm alone), but doesn't really answer this point.As @M-- points out in a comment, there is an open issue here regarding the lack of a <overall_N> equivalent of
.N(i.e. one that ignoresby) and a potentially breaking overhaul of the.N,.Iand other symbols is a milestone for data.table 2.0.0.In the meantime it seems that the method mentioned in the question, plus the two-step chained alternative added above, are the closest we have to a canonical way of achieving this.