Is there a good way to get the entire length of the i - argument in a data.table?

93 Views Asked by At

Is there a good way to find out the length of entire data that one is working on in data.table, like with .N that returns the length of the group?

I give you an example:

library(data.table)
dt <- as.data.table(iris)
dt$Species[1:10] <- NA
dt[
  !is.na(Species),
  list(
    proportion = .N / 140,
    sepalwidthmean = mean(Sepal.Width)
  ),
  by = Species
]
dt[
  !is.na(Species),
  list(
    proportion = .N / sum(!is.na(dt$Species)),
    sepalwidthmean = mean(Sepal.Width)
  ),
  by = Species
]

As you can see, it takes quite some amount of typing to get the 140 (150-10) that is essentially redundant (already done in the I-argument of my query).

The way I did it feels a bit scary to me, because normally, data.table will look at the names of the columns instead of the names of the global scope of variables. So

dt$dt <- "dt"
dt[
  !is.na(Species),
  list(
    proportion = .N / sum(!is.na(dt$Species)),
    sepalwidthmean = mean(Sepal.Width),
    lalala = mean(`dt$Species`)
  ),
  by = Species
]

and you get Error in dt$Species : $ operator is invalid for atomic vectors. Is there an elegant way of doing it right?

It feels a bit bit like magic that normally, everything works out correctly:

dt$mean <- "mean"
dt[,mean(Sepal.Length)]

gives 5.843333 instead of an error. So maybe you can see why I want to have a better way than the one I am currently using.

1

There are 1 best solutions below

2
Tobo On BEST ANSWER

You can add the proportion column in a second chained step.

dt[!is.na(Species), .(sepalwidthmean = mean(Sepal.Width), .N), by=Species
 ][, proportion := N/sum(N)][]

      Species sepalwidthmean     N proportion
       <fctr>          <num> <int>      <num>
1:     setosa         3.4575    40  0.2857143
2: versicolor         2.7700    50  0.3571429
3:  virginica         2.9740    50  0.3571429

Edit: On a more attentive reading, the question is really about whether, given a data.table query with a by (and potentially also an i) argument present, there is a neater way of getting the denominator in <group_N> / <overall_N> than re-referencing (and re-filtering) the table, in this case via e.g. .N / sum(!is.na(dt$Species)). This is rather cumbersome when the data are filtered in i and a simple .N / nrow(dt) won't work.

The answer above, i.e. .N for <group_N> then sum(N) for <overall_N> in a second step, is what I use for this very common pattern (and I doubt I'm alone), but doesn't really answer this point.

As @M-- points out in a comment, there is an open issue here regarding the lack of a <overall_N> equivalent of .N (i.e. one that ignores by) and a potentially breaking overhaul of the .N, .I and other symbols is a milestone for data.table 2.0.0.

In the meantime it seems that the method mentioned in the question, plus the two-step chained alternative added above, are the closest we have to a canonical way of achieving this.