Aggregate values into a new column while retaining the old column

22 Views Asked by At

Question

Say you have some simple data about some purchases:

user_id order_date product_id
001 mon 2e1
001 mon 44h
001 tues e6f
002 wed 6g3
002 wed 43m
003 wed k19
003 fri 9d5

And I need to aggregate the product IDs into an array column, e.g. using COLLECT_SET, grouping by user_id and order_date. HOWEVER I also wish to retain the product_id column, as so:

user_id order_date product_id product_ids
001 mon 2e1 ["2e1","44h"]
001 mon 44h ["2e1","44h"]
001 tues e6f ["e6f"]
002 wed 6g3 ["6g3","43m"]
002 wed 43m ["6g3","43m"]
003 wed k19 ["k19"]
003 fri 9d5 ["9d5"]

Problem

I can easily create the array column with the following query:

SELECT user_id, 
       order_date, 
       COLLECT_SET(product_id) AS product_ids
FROM table t
GROUP BY user_id, order_date

But that way I don't get the product_id column for every row, which I need.

Meanwhile if I include the product_id as so:

SELECT user_id,
       order_date, 
       product_id, 
       COLLECT_SET(product_id) AS product_ids
FROM table t
GROUP BY user_id, order_date, product_id

Then the product_ids column will always be an array of length one, ie:

user_id order_date product_id product_ids
001 mon 2e1 ["2e1"]
001 mon 44h ["44h"]

And of course if I exclude product_id from the GROUP BY then I get an error: "Expression not in GROUP BY key 'product_id"

Is it possible to do this in a single simple query, without e.g. creating a temp table and then joining them on user_id and order_date? Thanks!

1

There are 1 best solutions below

0
Zero On

The reason you're not getting the correct result is you're simply using the COLLECT_SET function on rows, grouping on all other columns (which would return the same table).

You can aggregate the table on user_id and order_date, and create an aggregate dataset. Then join the main table with this aggregated dataset based on those 2 columns and you'd get the expected result.

SELECT
    t1.user_id,
    t1.order_date, 
    t1.product_id, 
    t2.product_ids
FROM 
    table t1
LEFT JOIN (
    SELECT 
        user_id, 
        order_date, 
        COLLECT_SET(product_id) AS product_ids
    FROM 
        table t
    GROUP BY 
        user_id, order_date
) AS t2

Here's the query to do the same. First the main table, joined with the aggregate table, and the select table tables the COLLECT_SET result as product_ids column here.

The subquery would return the following dataset

user_id order_date product_ids
001 mon ["2e1","44h"]
001 tues ["e6f"]
002 wed ["6g3","43m"]
003 wed ["k19"]
003 fri ["9d5"]

Then the overall query's result would be

user_id order_date product_id product_ids
001 mon 2e1 ["2e1","44h"]
001 mon 44h ["2e1","44h"]
001 tues e6f ["e6f"]
002 wed 6g3 ["6g3","43m"]
002 wed 43m ["6g3","43m"]
003 wed k19 ["k19"]
003 fri 9d5 ["9d5"]