Is it correct to add "UNNEST" in the "ON" condition of a (left) join?

63 Views Asked by At

Suppose i have a bigquery table "table1" containing a column "field1" and a table "table2" containing an array column "field2"

I want to join each line of table1 to each line of table2 for which the value of "field1" appears in the array "field2"

I tried the query below but it seems to take too much time in Bigquery, it never ended actually for the cases i tested, so i think there is something wrong about it

Is there a better way to achieve this ?

Tested query :

SELECT *
FROM table1
LEFT JOIN table2 ON table1.field1 IN UNNEST(table2.field2)

Result :

Operation timed out after 6.0 hours. Consider reducing the amount of work performed by your operation so that it can complete within this limit.

1

There are 1 best solutions below

0
Samuel On

It could be that the rows of the joined table are much more than expected. It could be that table2 is much larger than table1 and therefore a right join could be more efficent. First we obtain the final row size of the joined table.

WITH table1 as (SELECT 'a'||x as field1 from unnest(generate_array(1,1000)) as x),
table2 as (SELECT x, ((SELECT  array_agg(if(y=x or y in (5,6,7,8),'',y||'')||'a'||y) as field2 from unnest(generate_array(0,1000)) as y)) as field2 from unnest(generate_array(1,1000)) as x)

,helper as (SELECT distinct field1 from table1)
,test as (
SELECT #*,
(( SELECT struct(count(x) as counts,array_agg(x) as data) from unnest(field2) as x inner join helper on field1=x )).*
 from table2 
)
#gather statistics
SELECT count(1) as row_counts_table2, sum(counts) as row_counts_join, min(counts) as min, max(counts) as max, from test

#solution with right join 
#SELECT * FROM (SELECT * FROM test, unnest(data) as data2join) right join table1 on field1= data2join
  1. Two temporary tables (table1 and table2) are created using CTEs for the example data. With y in (...) several machings to field1 are in the array field2.
  2. A CTE named helper is created to store distinct values of field1 from table1. We only want each entry once.
  3. The CTE test is using a subSELECT to performe a inner join between field2 and field1. We keep the counts and all entries by using the struct.
  4. Using the field counts some information can be obtained, such as the final row size of the joined table. Is this one feasible?
  5. Unnesting the data array from the CTE test, yields all posible join rows from table 2. The right join with table1 keeps all entries from table1 and adds the unnest ones.