Subsetting a long-data.table using values of a column within the data.table and casting the other values

68 Views Asked by At

I have a 22 million observation rows data table of the following form:

`dt <- data.table(
  firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
  metric = c("AN_BILANT", "OPEX", "CAPEX","AN_BILANT","OPEX", "CAPEX", "AN_BILANT", "OPEX", "CAPEX", "AN_BILANT","OPEX", "CAPEX"),
  value = c(2013, 10, 3,2014, 11, 5, 2007, 25, 10, 2009, 23, 7)
)`

I would like to generate the following output using data.table

`output_dt <- data.table(
  firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
  metric = c("OPEX", "CAPEX","OPEX", "CAPEX", "OPEX", "CAPEX", "OPEX", "CAPEX"),
AN_BILANT = c(2013, 2013, 2014, 2014, 2007, 2007, 2009) 
  value = c( 10, 3,11, 5, 25, 10,23, 7)
)
`

I initially tried the following:

dcast(dt[metric == "AN_BILANT"], firm_id ~ metric, value.var = "value", fun.aggregate = function(x) x)

but I get the following error

Error: Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.

I also tried

dcast.data.table(dt[, N:=1:.N, metric], firm_id~metric, subset = (metric=="AN_BILANT") ) Here I get the warning

Aggregate function missing, defaulting to 'length'

2

There are 2 best solutions below

0
akrun On

We may create the 'AN_BILANT' by assiging (:=) the first 'value' after grouping by the cumulative sum of a logical vector and then remove those 'AN_BILANT' rows

library(data.table)
dt[, AN_BILANT :=  value[1], cumsum(metric == 'AN_BILANT')][
    metric != 'AN_BILANT']

-output

    firm_id metric value AN_BILANT
1:       1   OPEX    10      2013
2:       1  CAPEX     3      2013
3:       1   OPEX    11      2014
4:       1  CAPEX     5      2014
5:       2   OPEX    25      2007
6:       2  CAPEX    10      2007
7:       2   OPEX    23      2009
8:       2  CAPEX     7      2009
4
r2evans On

I like akrun's approach, but if the data forces you to choose an alternative (in case cumsum is too sensitive to ordering of data), you can try a dcast/melt approach like below. Note that since firm_id does not contain just one of each metric, we need to add another variable temporarily so that we don't over-reduce during the initial dcast.

library(data.table)
dcast(DT[, grp := seq_len(.N), by = .(firm_id, metric)],
      firm_id + grp ~ metric, value.var = "value")[, grp := NULL] |>
  melt(c("firm_id", "AN_BILANT"), variable.name = "metric")
#    firm_id AN_BILANT metric value
#      <num>     <num> <fctr> <num>
# 1:       1      2013  CAPEX     3
# 2:       1      2014  CAPEX     5
# 3:       2      2007  CAPEX    10
# 4:       2      2009  CAPEX     7
# 5:       1      2013   OPEX    10
# 6:       1      2014   OPEX    11
# 7:       2      2007   OPEX    25
# 8:       2      2009   OPEX    23

Granted, the order of the rows is not the same, but double-reshaping typically does not guarantee that.