Does ClickHouse support the implementation of Bloom Filters?

39 Views Asked by At

ClickHouse offers Bloom filter indexes:explore their functionality and implementation methods to see how these probabilistic data structures can accelerate queries and lessen computational burden

What is the Bloom filter index in ClickHouse

Implementation methods of leveraging Bloom Filters to optimize query performance and reduce computational load.

1

There are 1 best solutions below

0
Rich Raposa On

Let's start at the top first: every MergeTree table has a primary index, which is built from the primary key. The primary index consists of the values of the primary of the first row of each granule, which is every 8,192 rows. The table data is sorted on disk by the primary key (or the sort order if the two are different, but the primary key must be a subset of the sort order).

If you want, you can define a secondary index on a MergeTree table, which is also referred to as a skipping index. What do they skip? They skip your existing granules. How? Well, it depends on how you define the skipping index. The are different types, like set, minmax and bloom_filter.

For example, a minmax skipping index stores the minimum and maximum value of a specific column for each existing granule. More accurately, for the GRANULARITY of the skipping index - which is a multiple of the table's granule size. For example, a skipping index with GRANULARITY 10 will contain the values from 10 consecutive granules, or 81,920 rows. So that is how many rows you could potentially skip using this index. Let's call this a block of rows.

The bloom_filter skipping index stores the values from each block in a Bloom filter data structure (similar to what set does except set attempts to store every value of the column from each block, while a Bloom filter is a bitmap of 1's and 0's - but I assume you know that already).

So that is where Bloom filters exist in ClickHouse. All the usual Bloom filter rules apply here: you can get false positives -but there is no harm there except for the unnecessary processing of rows. Your WHERE clause will filter out any false positives.

I hope that helps...