What is a count-min sketch? When would you use it?

Question

What is a count-min sketch? When would you use it?

2.3k Views Asked by templatetypedef At 17 October 2020 at 00:14

What is a count-min sketch? In what situations would it be useful?

There are 1 best solutions below

**templatetypedef** · Accepted Answer · 2020-10-17T00:14:59.730000

The Mile-High Overview

Intuitively, you can think of a count-min sketch as a space-efficient data structure for approximating how many times you've seen a given item in a data stream. From a client-side perspective, the count-min sketch supports two operations:

increment(x), which says "I've seen x another time," and
estimate(x), which says "please give me an estimate of how many times I've seen x."

If you just look at this interface, you might be thinking "isn't this something I could do with a hash table or a BST?" And you'd be right - you could just make a BST whose keys are the elements and whose values are the number of times each item is seen, or do the same with a hash table. If you have enough memory to keep track of all the keys you're encountering, this is not an unreasonable. But imagine that you're working on, say, an Amazon server and you're tracking views of each product. Suddenly, even writing down the names of all the pages that are visited is going to take a ton of memory. Or imagine you're Google and you want to find the most-commonly-visited pages each day. It would be prohibitively expensive to get enough RAM simply to write down all the search queries, and paging out to disk would be too slow.

What makes the count-min sketch shine is that the amount of memory needed by a count-min sketch can be tweaked by the user. If you give it more memory, it will produce better estimates of the true frequencies of the elements it's seen. If you give it less memory, it will produce lower-quality estimates, but with quantifiable guarantees about how likely it is for those estimates to be close to the true value. This means that if, say, you only have 1MB to dedicate to this purpose, you could get rough estimates of the frequencies, whereas if you have 128MB you could get dramatically better estimates.

The estimates given back by a count-min sketch will never underestimate the true frequencies of the elements. For example, if a count-min sketch says that an item has appeared 50 times, then the element might have appeared 50 times, or 49 times, or 48 times, etc., but it't not possible for it to have appeared 100 times. This makes the count-min sketch useful for finding high-frequency items: low-frequency items might have their frequencies overestimated a little bit, but high-frequency items will always appear popular.

The Internal Details

The count-min sketch is a fairly straightforward data structure to implement. The basic idea is the following. Imagine we have an array of counters, and we want to use that array to keep track of the frequencies of the items we're seeing. We'll use a hash function to assign each item to some counter (compute its hash code and mod it by the table size). Whenever we increment that item, we'll just go to the appropriate counter, then increment it. To provide an estimate, we'll just go to that counter and return the value stored there.

One way of thinking about how the count-min sketch works is to imagine each counter as a "bucket" holding all items of some type. We then estimate how frequent something is by seeing how many items are in its bucket, regardless of what those items are, as shown here:

A collection of items distributed into buckets

As you can see, we get reasonably good estimates for frequent items, while infrequent items might have their frequencies grossly overestimated. This also gives a nice intuition as to why the count-min sketch never underestimates the frequencies of the elements. You'll always at least count the item itself when looking in its bucket.

If we make some reasonable assumptions about the hash functions we're using, we can formally prove that, on average, the estimate given back for an item's frequency is at most its actual frequency, plus the total number of items divided by the total number of counters. And that makes intuitive sense. If you have lots and lots of counters, at some point each item gets its own counter and the estimates will be perfectly accurate. On the other extreme, if you have almost no counters, then you'd expect all the items to be crammed into a small number of buckets, and the totals will start to be way off.

While this approach guarantees that on expectation the estimates returned will be good, that doesn't mean that you have a high probability of getting a good estimate. One way to think about this is to think about what it means to have a good estimate "on expectation." Intuitively, that would mean that if you were to build a bunch of these data structures, then the average of the estimates would probably be pretty good. However, any one individual estimate is not necessarily going to be all that good.

The count-min sketch takes this idea to heart and, instead of just having one array of counters, maintains several independent arrays of counters, each of which has a different hash function dropping items into buckets. This gives a degree of redundancy and means that it's highly unlikely that you'll get "unlucky" with items colliding in bad ways.

To get its overall estimate, the count-min sketch does something a bit more clever than just averaging the estimates together. Remember that the count-min sketch can never underestimate the actual frequencies of the items being stored, and can only overestimate them. This means that if you have a collection of different estimates coming back, the larger an estimate is, the worse it is. Why? Because the larger the estimate, the more items other than the one we care about it's counting. (Think back to buckets - if a bucket has a bunch of other items in it besides the one we care about, we don't want to use the size of that bucket as an estimate).

This is where the "min" part of "count-min sketch" comes in. When returning an estimate, the count-min sketch returns the minimum estimate among all the estimates generated. This is the estimate with the least noise in it. Intuitively, this is very likely to give you a good estimate - the only way it fails is if every single estimate was bad, which is fairly unlikely.

This means that the overall data structure, and the logic to manipulate it, is fairly straightforward:

Image of a 2D grid representing multiple rows of counters, plus pseudocode summarizing how the structure works.

Learning More

There's more to explore about the count-min sketch. For example, how do you formally analyze a count-min sketch to determine how many counters you need per row or how many independent structures you'll need? What kinds of hash functions can you use? To learn more about that, check out these lecture slides, which go into some detail on the topic.

What happens if you want to support both increments and decrements? How do you use the count-min sketch to find the most frequent elements in a data stream? The original paper on the topic is a good resource here.

Can you get the same results as a count-min sketch without randomness? Yes, using some clever number theory.

What is a count-min sketch? When would you use it?

There are 1 best solutions below

The Mile-High Overview

The Internal Details

Learning More

Related Questions in DATA-STRUCTURES

Related Questions in STREAM

Related Questions in COUNT-MIN-SKETCH

Trending Questions

Popular # Hahtags

Popular Questions