Bigquery aggregate column array data

132 Views Asked by At

Considering a BigQuery view that has one column ARRAY<STRUCT<col1 FLOAT, col2 FLOAT>> called X.

Which is the best way to retrieve those rows with an additional column that is a computation of those elements in the "X" array. Is it possible to make it via big query stored procedure?

it would be great having something like:

select computation(X), * from something

computation(X) would loop through the various element of the X array and sum them with some additional rules. Regarding that, it seems BigQuery functions do not handle structures/arrays but only scalar types so X can't be taken in account due to the fact that datatype is ARRAY<STRUCT<col1 FLOAT, col2 FLOAT>>. Loops are still needed so I was thinking about using a procedure but I couldn't find any example that does operations on repeatable columns and give the result with the query itself.

One of the point was keeping this query usable within BigQuery query console, avoiding solutions like external script (like python).

Is there any example of procedures that internally have a select and they return additional information to it? Like a map function over a query. A sort of filter like the previous function computation (X) example.

As requested, to give more context I have a column that is:

ARRAY<STRUCT<pricing_unit_quantity FLOAT64,
start_usage_amount FLOAT64, usd_amount FLOAT64, account_currency_amount FLOAT64>>

It contains GCP price tiers. I have to iterate through all of them and compute the final price. FLOAT64 is a placeholder for a reliable datatype for currencies. I am still looking for that on BigQuery.

I would reach something like that, implementing a function called get_tiers_total_expense


-- Ex. Those are 2 tiers.
-- Tier 1: It starts from 0 usages and goes until 20. The costs 10 for every unit used.
-- Tier 2: It starts from 20 usages and cost is 5

select get_tiers_total_expense(array(
  select as struct 1.0, 0.0, 10.0, 9.0 union all
  select as struct 1.0, 20.0, 5.0, 4.0 as tiered_rates));
2

There are 2 best solutions below

2
Jaytiger On BEST ANSWER

You might consider below. (I would suggest SQL UDF rather than JS UDF.)

-- sample data
CREATE TEMP TABLE tiers AS
SELECT 1.0 pricing_unit_quantity, 0.0 start_usage_amount,
       10.0 usd_amount, 9.0 account_currency_amount UNION ALL
SELECT 1.0, 20.0, 5.0, 4.0;

-- define UDFs here
CREATE TEMP FUNCTION get_tiers_total_expense (
  tiers ARRAY<STRUCT<pricing_unit_quantity FLOAT64,
                     start_usage_amount FLOAT64,
                     usd_amount FLOAT64,
                     account_currency_amount FLOAT64>>
) AS ((
  -- it takes tiers and return the final cost
  -- -> you can adjust the aggregation logic
  SELECT SUM(pricing_unit_quantity * usd_amount * account_currency_amount)
    FROM UNNEST(tiers) tier
));

-- sample query here using UDFs and sample data
SELECT get_tiers_total_expense(
         ARRAY_AGG(
           STRUCT(
             pricing_unit_quantity,
             start_usage_amount,
             usd_amount,
             account_currency_amount
           )
         )
       ) AS get_tiers_total_expense
  FROM tiers t
 --GROUP BY -- later you can change group-by columns depending on your use cases.

-- query result
+-------------------------+
| get_tiers_total_expense |
+-------------------------+
|                   110.0 |
+-------------------------+
0
Yak O'Poe On

I ended up by following the suggestion of @jaytiger and so I created a BigQuery UDF.

I had to iterate through all the gcp tiers to calculate the final cost cost. I used FLOAT as placeholder for currencies. I surely need to figure out which is a better data type for storing them in BigQuery...

Here the partial solution:

CREATE OR REPLACE FUNCTION get_tiers_total_expense(
tiers ARRAY<STRUCT<pricing_unit_quantity FLOAT64,
start_usage_amount FLOAT64, usd_amount FLOAT64, account_currency_amount FLOAT64>>
)
RETURNS NUMBER
LANGUAGE js AS """
// it takes tiers and return the final cost
// Luckily float64 are converted to Number in JS and not float
// that is not great for currency computation
""";


-- Ex. Those are 2 tiers.
-- Tier 1: It starts from 0 usages and goes until 20. The costs 10 for every unit used.
-- Tier 2: It starts from 20 usages and cost is 5

select get_tiers_total_expense(array(
  select as struct 1.0, 0.0, 10.0, 9.0 union all
  select as struct 1.0, 20.0, 5.0, 4.0 as tiered_rates));