Delta Lake - Data skipping with z order and bloom filter index

1.2k Views Asked by At

I am trying to optimize the transformations in my etl pipeline in Databricks using Data skipping for delta lake. I tried z-order and bloom filter index. However, I am unable to see what the impact is. Where can I see if it actually contributed to parquet files that were skipped during reading? The screenshot below is from reading a delta table with bloom filter index on one column, and Z-order on one column as well. In the screenshot it shows number of files pruned, does this show the number of parquet files skipped when reading?

So my question is, what gives in general the best performance improvements? Z-ordering, bloom filter indexing, or a combination of both? And how can I check which combination of columns (that are Z-ordered or have a bloom filter index) gives the best performance improvements?

enter image description here

1

There are 1 best solutions below

0
Denny Lee On

Z-Order and Bloom Filter Indexes can be run independently of each other. In general:

  • Z-Order is best with around 3-5 columns where you prioritize common filter columns and then join keys.
  • Bloom Filters allow for faster point (needle in the haystack) queries so are handy for string columns like names and/or hashes.

Please start with this and if you'd like to dive deeper, check out Tech Talk: Top Tuning Tips for Spark 3.0 and Delta Lake on Databricks (shameless plug here as I'm one of the speakers)