I have an enormous clickstream data which I can pull the timestamp, user_ids corresponding item_ids that customers clicked.
I am trying to create a dataset that will have the unique customer ids (user_id) on each row and another column that has key:value pair that provides me the list of items and number of times corresponding item is clicked like a python dictionary (or this can be in two columns with items and corresponding num. clicks as a sorted list from most frequent to least). The number of clicks for an item is the times that item_id occur for a fixed user_id. Basically, every occurrence of item is one click. If it occurs 10 times for a fixed user_id, that means that click_count = 10 for that user_id, item_id pair.
Here is an example:
| USER_ID | CLICKED_ITEMS |
|---|---|
| user1 | item3: 20, item2: 10, item1: 5 |
| user2 | item1: 10, item2: 1, item3: 1 |
| user3 | item2: 1 |
| user4 | item1: 10, item2: 2 |
or
| USER_ID | CLICKED_ITEMS | NUM_CLICKS |
|---|---|---|
| user1 | item3, item2, item1 | 20, 10, 5 |
| user2 | item1, item2, item3 | 10, 1, 1 |
| user3 | item2 | 1 |
| user4 | item1, item2 | 0, 2 |
Is this possible using SQL (esp. for Snowflake syntax)? TIA
I tried
SELECT user_id, using ARRAY_UNIQUE_AGG(item_id) AS clicked_items
FROM clickstream_table
GROUP_BY user_id
and got unique user_ids and corresponding list of items but not sure how to find number of times each item is clicked. I can do this easily using pandas but unfortunately I cannot even pull 1 month of data due to size and timeout limitations (3hr limit) on warehouse.
Assuming you know how to aggregate your data to be like the following:
the results you are after can be done with ARRAY_AGG and OBJECT_CONSTRUCT
gives:
or if you want it to look "exactly" like you have shown, then the second option can be done with two LISTAGG
giving:
well, the first output can be created with string concatenations also.
gives: