Delta Lake - Data skipping with z order and bloom filter index

1.2k Views Asked by gamezone25 At 07 December 2022 at 18:12

I am trying to optimize the transformations in my etl pipeline in Databricks using Data skipping for delta lake. I tried z-order and bloom filter index. However, I am unable to see what the impact is. Where can I see if it actually contributed to parquet files that were skipped during reading? The screenshot below is from reading a delta table with bloom filter index on one column, and Z-order on one column as well. In the screenshot it shows number of files pruned, does this show the number of parquet files skipped when reading?

So my question is, what gives in general the best performance improvements? Z-ordering, bloom filter indexing, or a combination of both? And how can I check which combination of columns (that are Z-ordered or have a bloom filter index) gives the best performance improvements?

Original Q&A

There are 1 best solutions below

Denny Lee On 09 December 2022 at 05:03

Z-Order and Bloom Filter Indexes can be run independently of each other. In general:

Z-Order is best with around 3-5 columns where you prioritize common filter columns and then join keys.
Bloom Filters allow for faster point (needle in the haystack) queries so are handy for string columns like names and/or hashes.

Please start with this and if you'd like to dive deeper, check out Tech Talk: Top Tuning Tips for Spark 3.0 and Delta Lake on Databricks (shameless plug here as I'm one of the speakers)

Delta Lake - Data skipping with z order and bloom filter index

There are 1 best solutions below

Related Questions in DATABRICKS

Related Questions in DELTA-LAKE

Related Questions in Z-ORDER

Related Questions in BLOOM-FILTER

Trending Questions

Popular # Hahtags

Popular Questions