Is there an optimized version of counting bloom filters for the case when counts are very large?

160 Views Asked by At

I became curious about this after seeing this paper on bloom clocks. The basic idea is that a counting bloom filter, which supports both the sum (counterwise sum) and union (counterwise max) of multisets, can be used as a compromise between scalar Lamport clocks and vector clocks, where each node increments a hash-determined subset of the counters by 1 on an event.

Taking an example like 16 counters with each node incrementing 3 of them you still get a clock that is fairly good at finding runs of a thousand causally unrelated events in the high concurrency limit where many events can have similar timestamps.

However, for a practical implementation of things like causally ordered message topics where you can end up with more than 4 billion events in a strict sequence, you still need the counting bloom filter to use 64-bit counters.

This made me wonder if there are specialized data structures to minimize space usage for the very specific case of a probabilistic multiset with very large counts, that needs to support multiset sum and union?

1

There are 1 best solutions below

1
Matt Timmermans On

You can use floating point counters.

For example, lets say you store a counter as a pair of 8-bits numbers (e,m), representing the count m << e.

When incrementing a counter:

  • if e > 0, then the counter is missing e low-order bits. In order to compensate for that, you randomly decide whether or not to increment m, with probability 2-e
  • If incrementing m would overflow its 8 bits, then increment e and shift m to the right.

Of course, this makes the counters a little bit inaccurate, but the expected inaccuracy is small compared to the count.