Why is Redshift query plan showing join between tables not joined in the query

Question

Why is Redshift query plan showing join between tables not joined in the query

49 Views Asked by Teuh1975 At 05 February 2024 at 20:45

I'm facing a query performance on a quite complex query in Redshift and as a noob in Redshift I don't understand. The query is made of several joins (INNER and LEFT) and doesn't return data in hours.

I commented/uncommented each join to understand where the bottleneck was coming from. Once the query performance was exploding, I took a look at the query plan and I noticed that Redshift was performing a join between 2 tables (respectively around 1 million records and 15 thousand records) not directly joined in the query. For this join, redshift was only using the distributed key, which in the present case generate a join close to a cartesian product. Monitoring the query execution showed that the "unwanted" join was creating a internal working table of several billions (with a b) of records

I'd like to understand why the optimizer performs this join and how to prevent it?

As additional information, I'm using external table in PARQUET format stored in S3. I also noticed that by putting a condition on one of the table of the "unwanted" join, the plan was different and the join disappeared. -- EDIT Here's a piece of the explain plan. If I understand correctly the second line shows a join using only the distribution key whereas nowhere in the query there's is a join using only this distribution key

->  XN Hash Join DS_BCAST_INNER  (cost=740000221.79..1073625281.79 rows=50000000 width=70)
  Hash Cond: ("outer".distfield = "inner".distfield)
  Remarks: Derives subplan 3
  ->  XN Partition Loop  (cost=0.00..251000060.00 rows=50000000 width=44)
        ->  XN Seq Scan PartitionInfo of schemaname.table1  (cost=0.00..60.00 rows=1 width=8)
              Filter: (((((filterfieldview)::text = 'v1'::text) AND (distfield = 26)) AND (subplan 3: ($7 = distfield)) AND (subplan 4: ($9 = distfield)) AND (subplan 5: ($13 = distfield)) AND (subplan 6: ($14 = distfield)) AND (subplan 7: ($16 = distfield)) AND (subplan 8: ($19 = distfield)) AND (subplan 9: (distfield = $22)) AND (subplan 10: (distfield = $25)) AND (subplan 11: ($28 = distfield)) AND (subplan 12: ($31 = distfield)) AND (subplan 13: ($34 = distfield)) AND (subplan 14: ($36 = distfield)) AND (subplan 15: (distfield = $40)) AND (subplan 16: ($46 = distfield)) AND (subplan 17: (distfield = $50)) AND (subplan 18: ($52 = distfield)))
        ->  XN S3 Query Scan table1  (cost=0.00..125500000.00 rows=50000000 width=36)
              ->  S3 Seq Scan schemaname.table1 location:"s3://s3location/TABLE1" format:PARQUET  (cost=0.00..125000000.00 rows=50000000 width=36)
                    Filter: ((condition1)::text = 'A'::text)
  ->  XN Hash  (cost=740000221.29..740000221.29 rows=200 width=26)
        ->  XN Subquery Scan volt_dt_11  (cost=740000219.29..740000221.29 rows=200 width=26)
              ->  XN HashAggregate  (cost=740000219.29..740000219.29 rows=200 width=26)
                    ->  XN Hash Join DS_DIST_BOTH  (cost=650000097.51..740000217.56 rows=347 width=26)
                          Outer Dist Key: "inner".derived_col1
                          Inner Dist Key: ref_darwin_lookup.derived_col1
                          Hash Cond: (("outer".derived_col1 = "inner".derived_col1) AND ("outer".distfield = "inner".distfield))
                          ->  XN Partition Loop  (cost=300000006.25..300000101.25 rows=1000 width=26)
                                ->  XN Seq Scan PartitionInfo of schemaname.table2  (cost=0.00..75.00 rows=1 width=8)
                                      Filter: ((((filterfieldview)::text = 'v1'::text) AND (distfield = 26)))
                                ->  XN S3 Query Scan table2  (cost=150000003.13..150000013.13 rows=1000 width=18)
                                      ->  S3 HashAggregate  (cost=150000003.13..150000003.13 rows=1000 width=18)
                                            ->  S3 Seq Scan schemaname.table2 location:"s3://s3location/TABLE2" format:PARQUET  (cost=0.00..150000000.00 rows=1251 width=18)
                                                  Filter: (((final_value)::text = 'T'::text) AND ((condition1)::text = 'A'::text) AND (condition2 IS NULL))
                          ->  XN Hash  (cost=350000086.26..350000086.26 rows=1000 width=26)
                                ->  XN Partition Loop  (cost=350000006.26..350000086.26 rows=1000 width=26)
                                      ->  XN Seq Scan PartitionInfo of schemaname.ref_darwin_lookup  (cost=0.00..60.00 rows=1 width=8)
                                            Filter: ((((filterfieldview)::text = 'v1'::text) AND (distfield = 26)))
                                      ->  XN S3 Query Scan ref_darwin_lookup  (cost=175000003.13..175000013.13 rows=1000 width=18)
                                            ->  S3 HashAggregate  (cost=175000003.13..175000003.13 rows=1000 width=18)
                                                  ->  S3 Seq Scan schemaname.ref_darwin_lookup location:"s3://s3location/LOOKUP/" format:PARQUET  (cost=0.00..175000000.00 rows=1251 width=18)
                                                        Filter: (((condition1)::text = 'TYPE2'::text) AND ((condition2)::text = 'Numeric'::text) AND ((condition3)::text = 'A'::text))
->  XN Hash  (cost=300000095.02..300000095.02 rows=1251 width=44)
  ->  XN Partition Loop  (cost=0.00..300000095.02 rows=1251 width=44)
        ->  XN Seq Scan PartitionInfo of schemaname.table3 volt_dt_8  (cost=0.00..70.00 rows=1 width=8)
              Filter: ((((filterfieldview)::text = 'v1'::text) AND (distfield = 31)))
        ->  XN S3 Query Scan volt_dt_8  (cost=0.00..150000012.51 rows=1251 width=36)
              ->  S3 Seq Scan schemaname.table3 volt_dt_8 location:"s3://s3location/TABLE3" format:PARQUET  (cost=0.00..150000000.00 rows=1251 width=36)
                    Filter: (((condition3)::text = 'T'::text) AND ((condition1)::text = 'A'::text) AND (condition2 IS NULL))

when I monitor the execution of the query I see millions of rows processed at the step treating the TABLE1 where this table has only thousand of records...

Original Q&A

Why is Redshift query plan showing join between tables not joined in the query

There are 0 best solutions below

Related Questions in AMAZON-REDSHIFT

Related Questions in AMAZON-REDSHIFT-SPECTRUM

Trending Questions

Popular # Hahtags

Popular Questions