Apache Druid GroupBy Virtual columns

871 Views Asked by At

I am trying to do a groupby virtual column in a Druid native query which looks like this...

{
  "queryType": "groupBy",
  "dataSource": "trace_info",
  "granularity": "none",
  "virtualColumns": [
    {
      "type": "expression",
      "name": "tenant",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'tenant')), 'tenant:', '')"
    },
    {
      "type": "expression",
      "name": "rc",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'row_count')), 'row_count:', '')"
    }
  ],
  "dimensions": [
    "tenant"
  ],
  "aggregations": [
    {
      "type": "longSum",
      "name": "trc",
      "fieldName": "rc"
    }
  ],

...
...
...

  "intervals": [
    "..."
  ]
}

This gives out a single row with longsum of all row_counts as if the groupBy column is null.

Is my usage correct or is this a known issue in Druid. The documentation says virtual columns can be used like normal dimensions but, is not very clear on how or even a working example is missing.

Thanks! Phani

1

There are 1 best solutions below

3
PhaKuDi On

Latest Edit...

Some more digging to find out that the issue was with missing "outputType" attributes on the the virtual columns. Strange because the aggregator is able to auto-detect time and calculate the long sum properly even though the group by results were wrong.

  "virtualColumns": [
    {
      "type": "expression",
      "name": "tenant",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'tenant')), 'tenant:', '')",
      "outputType": "STRING"
    },
    {
      "type": "expression",
      "name": "rc",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'row_count')), 'row_count:', '')"
      "outputType": "LONG"
    }
  ],

See above (below is likely a non-performant way of working around the problem).

After some trial and error I have a workaround for this using extraction dimensions. Although not sure, I suspect that this is a temporary issue in Druid 0.18.1. Hopefully Grouping on VCs will work as advertised in future builds.

{
  "queryType": "groupBy",
  "dataSource": "trace_info",
  "granularity": "none",
  "virtualColumns": [
    {
      "type": "expression",
      "name": "tenant",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'tenant')), 'tenant:', '')"
    },
    {
      "type": "expression",
      "name": "rc",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'row_count')), 'row_count:', '')"
    }
  ],
  "dimensions": [
    {
      "type": "extraction",
      "dimension": "tenant",
      "outputName": "t",
      "extractionFn": {
        "type" : "substring", "index" : 1
      }
    }
  ],
  "aggregations": [
    {
      "type": "longSum",
      "name": "trc",
      "fieldName": "rc"
    }
  ],

...
...
...

  "intervals": [
    "..."
  ]
}