Dynamically lateral flatten and pivot in Snowflake / Snowpark to columns

170 Views Asked by At

I am looking to flatten a column named 'EVENT_PARAMS_JSON' in snowflake having json values.

select EVENT_PARAMS_JSON from GA4_EVENT_DETAILS limit 1;

results in below output.

[
    {
    "key": "firebase_screen",
    "value":
        {
        "double_value": null, "float_value": null, "int_value": null, "string_value": "rewards"
        }
    },
    {
    "key": "firebase_previous_screen",
    "value":
        {
        "double_value": null, "float_value": null, "int_value": null, "string_value": "fuelpay_dashboard_offsite"
        }
    },
    {
    "key": "firebase_screen_class",
    "value":
        {
        "double_value": null, "float_value": null, "int_value": null, "string_value": "View"
        }
    },
    {
    "key": "firebase_previous_class",
    "value":
        {
        "double_value": null, "float_value": null, "int_value": null, "string_value": "View"
        }
    },
    {
    "key": "firebase_screen_id",
    "value":
        {
        "double_value": null, "float_value": null, int_value": "4511208285398238808", string_value": null
        }
    },
    {
    "key": "engaged_session_event",
    "value":
        {
        "double_value": null, "float_value": null, "int_value": 1, "string_value": null
        }
    },
    {
    "key": "theme",
    "value":
        {
        "double_value": null, "float_value": null, "int_value": null, "string_value": "lightMode"
        }
    }
]

I would like my output table to have keys as column names and value as rows. Kindly note each key has multiple values, and the right value is the one which is not blank/null.

enter image description here

I tried this in snowpark but got an error :

100357 (P0000): Python Interpreter Error:

raise error_class(
snowflake.snowpark.exceptions.SnowparkSQLException: (1304): 01b0d218-0001-087c-0000-24952743d6f2: 001007 (22023): SQL compilation error:
invalid type [VARCHAR(16777216)] for parameter '1'
import snowflake.snowpark as snowpark

def main(session: snowpark.Session): 
    # Your code goes here, inside the "main" handler.
    tableName = 'GA4_EVENT_DETAILS'
    df = session.table(tableName).filter((col("app_info_id") == 'au.com.caltex.flagship')&   (col("event_name") == 'screen_view')& (col("event_date") == '20230602')&(col("EVENT_PARAMS_FIREBASE_SCREEN_CLASS")=='View'))

    # Print a sample of the dataframe to standard output.
    df.show()
    df = df.join_table_function("flatten", df["EVENT_PARAMS_JSON"]).drop(["SEQ", "PATH", "INDEX", "THIS"])
1

There are 1 best solutions below

2
Simeon Pilgrim On

to show this is mostly a SQL problem, lets get a CTE with a raw input string being the same as you note, and then PARSE_JSON it, like Mat suggests, and see this is the first level problem:

so here is a CTE that has the string, and JSON as VARAINT (vis PARSE_JSON)

with data as (
    select
        '[{"key":"firebase_screen","value":{"string_value":"rewards","int_value":null,"float_value":null,"double_value":null}},{"key":"firebase_previous_screen","value":{"string_value":"fuelpay_dashboard_offsite","int_value":null,"float_value":null,"double_value":null}},{"key":"firebase_screen_class","value":{"string_value":"View","int_value":null,"float_value":null,"double_value":null}},{"key":"firebase_previous_class","value":{"string_value":"View","int_value":null,"float_value":null,"double_value":null}},{"key":"firebase_screen_id","value":{"string_value":null,"int_value":"4511208285398238808","float_value":null,"double_value":null}},{"key":"engaged_session_event","value":{"string_value":null,"int_value":1,"float_value":null,"double_value":null}},{"key":"theme","value":{"string_value":"lightMode","int_value":null,"float_value":null,"double_value":null}}]' as raw_string,
    parse_json(raw_string) as json
)

now if we try flatten the raw string text:

select j.*
from data as d,
lateral flatten(input=>d.raw_string, mode=>'array') as j;

invalid type [VARCHAR(860)] for parameter 'input'

invalid type [VARCHAR(860)] for parameter 'input'

which matches your error:

so instead using the VARAINT type, via PARSE_JSON which is the expected input to FLATTEN:

select j.*
from data as d,
lateral flatten(input=>d.json, mode=>'array') as j;

success:

enter image description here

so now we have data we can work with, if we pull the key and value from value:

select j.value:key
    ,j.value:value
from data as d,
lateral flatten(input=>d.json, mode=>'array') as j;

we get:

enter image description here

so now we want to pull the four values types out:

select j.value:key
    ,j.value:value:double_value
    ,j.value:value:float_value
    ,j.value:value:int_value
    ,j.value:value:string_value
from data as d,
lateral flatten(input=>d.json, mode=>'array') as j;

enter image description here

so now we want to COALESCE or NVL those values to keep the single non-null value:

select j.value:key
    ,COALESCE(
        j.value:value:double_value::text, 
        j.value:value:float_value::text,
        j.value:value:int_value::text, 
        j.value:value:string_value::text) as val
from data as d,
lateral flatten(input=>d.json, mode=>'array') as j;

gives:

enter image description here

so this is all great, now as you state, you want the row keys, to be the column headers, this is done via PIVOT where you have to know the names of the keys/column, via a manual pivot, where you have to know the names of the keys, or dynamically, where now your results can change randomly depending on the data..

I will show the manual pivot method, because it's the one I like.

with data as (
    select
        '[{"key":"firebase_screen","value":{"string_value":"rewards","int_value":null,"float_value":null,"double_value":null}},{"key":"firebase_previous_screen","value":{"string_value":"fuelpay_dashboard_offsite","int_value":null,"float_value":null,"double_value":null}},{"key":"firebase_screen_class","value":{"string_value":"View","int_value":null,"float_value":null,"double_value":null}},{"key":"firebase_previous_class","value":{"string_value":"View","int_value":null,"float_value":null,"double_value":null}},{"key":"firebase_screen_id","value":{"string_value":null,"int_value":"4511208285398238808","float_value":null,"double_value":null}},{"key":"engaged_session_event","value":{"string_value":null,"int_value":1,"float_value":null,"double_value":null}},{"key":"theme","value":{"string_value":"lightMode","int_value":null,"float_value":null,"double_value":null}}]' as raw_string,
    parse_json(raw_string) as json
), data_as_rows as (
    select j.seq, 
        j.value:key::text as col_key
        ,COALESCE(
            j.value:value:double_value::text, 
            j.value:value:float_value::text,
            j.value:value:int_value::text, 
            j.value:value:string_value::text) as val
    from data as d,
    lateral flatten(input=>d.json, mode=>'array') as j
)
select 
    max(iff(col_key='firebase_screen', val, null)) as firebase_screen
    ,max(iff(col_key='firebase_previous_screen', val, null)) as firebase_previous_screen
    ,max(iff(col_key='firebase_screen_class', val, null)) as firebase_screen_class
    ,max(iff(col_key='firebase_previous_class', val, null)) as firebase_previous_class
    ,max(iff(col_key='firebase_screen_id', val, null)) as firebase_screen_id
    ,max(iff(col_key='engaged_session_event', val, null)) as engaged_session_event
    ,max(iff(col_key='theme', val, null)) as theme
from data_as_rows
group by seq

gives:

enter image description here

The selecting of the SEQ from the FLATTEN allows stitching related rows back together, and thus why it is used in the GROUP BY.

Now you will need to rebuild this SQL into your python instructions to get snowflake to turn it back into the equivalent SQL, so you results "just work".