What is a count-min sketch? In what situations would it be useful?
What is a count-min sketch? When would you use it?
2.3k Views Asked by templatetypedef At
1
There are 1 best solutions below
Related Questions in DATA-STRUCTURES
- Why is the runtime for this O(n)?
- Purpose of last 2 while loops in the merge algorithm of merge sort sorting technique
- What is the problem in my "sumAtBis" code?
- Asking code suggestions about data structure and algorithm
- What would be the most efficient way to store multiple sets of fixed arrays (std::vector)?
- About Suffix Trees features
- Getting wrong answer in Binary Search solution
- Are there techniques to mathematically compute the amount of searching in greedy graph searching?
- AVL tree Nth largest operation - How to have all my tests pass? JAVA
- Why does the map size change?
- Complexity in Union of disjointed sets with lists
- Hash collisions in Golang map resolving
- C++ ordered map optimized with index access
- How to sort this list of strings along with the strings and output the result as expected?
- Why deleting an element in a linkedlist costs O(1)
Related Questions in STREAM
- How to start a download and render a response without hitting disk?
- How to properly handle byte buffers from C to Ada?
- Color Thresholding JS, Average Image Color Detect JS
- FastAPI finish streaming function in StreamingResponse even if client closed the connection
- How can I connect to a websocket from a vue app that is exposed to the network (yarn dev --host)?
- PHP: How to get the Content-Length from stream request for chunk download
- How to handle errors inside a NodeJS stream?
- Python TCP Server that both sends and or receives data (independently) using asyncio streams?
- Efficient string replace in a stream
- Using polly to generate audio from LLM output
- Stream YAML output, rather than loading everything into memory
- Python: Creating Zip file from Minio objects results in duplicate entries for each file
- Node.js/Express File Download Returns 0-Byte Plaintext Files
- Stream data from server component in NextJS 14 App Router
- How to read from last position when logstream is interrputed
Related Questions in COUNT-MIN-SKETCH
- Why are bloom filters not implemented like count-min sketch?
- What is a count-min sketch? When would you use it?
- Does the count-min sketch take less space than a typical sparse vector format?
- How to get top K elements from count-min-sketch?
- Use which hash functions for count-min sketch?
- Count-Min Sketch and Heavy-Hitters problem
- store top k results from count-min-sketch
- Retrieve the average count in count-min-sketch datastructure
- Count Min Sketch: How to handle counters overflow?
- How does count min sketch find the most frequent item in a stream? - Heavy Hitters
- What is max element can be add to a count min sketch, and how to use it
- How can i determine the width and depth of a count-min sketch?
- Which Hash functions can be used in count-min sketch?
- Non-trivial usage of count-min sketch data-structure
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
The Mile-High Overview
Intuitively, you can think of a count-min sketch as a space-efficient data structure for approximating how many times you've seen a given item in a data stream. From a client-side perspective, the count-min sketch supports two operations:
If you just look at this interface, you might be thinking "isn't this something I could do with a hash table or a BST?" And you'd be right - you could just make a BST whose keys are the elements and whose values are the number of times each item is seen, or do the same with a hash table. If you have enough memory to keep track of all the keys you're encountering, this is not an unreasonable. But imagine that you're working on, say, an Amazon server and you're tracking views of each product. Suddenly, even writing down the names of all the pages that are visited is going to take a ton of memory. Or imagine you're Google and you want to find the most-commonly-visited pages each day. It would be prohibitively expensive to get enough RAM simply to write down all the search queries, and paging out to disk would be too slow.
What makes the count-min sketch shine is that the amount of memory needed by a count-min sketch can be tweaked by the user. If you give it more memory, it will produce better estimates of the true frequencies of the elements it's seen. If you give it less memory, it will produce lower-quality estimates, but with quantifiable guarantees about how likely it is for those estimates to be close to the true value. This means that if, say, you only have 1MB to dedicate to this purpose, you could get rough estimates of the frequencies, whereas if you have 128MB you could get dramatically better estimates.
The estimates given back by a count-min sketch will never underestimate the true frequencies of the elements. For example, if a count-min sketch says that an item has appeared 50 times, then the element might have appeared 50 times, or 49 times, or 48 times, etc., but it't not possible for it to have appeared 100 times. This makes the count-min sketch useful for finding high-frequency items: low-frequency items might have their frequencies overestimated a little bit, but high-frequency items will always appear popular.
The Internal Details
The count-min sketch is a fairly straightforward data structure to implement. The basic idea is the following. Imagine we have an array of counters, and we want to use that array to keep track of the frequencies of the items we're seeing. We'll use a hash function to assign each item to some counter (compute its hash code and mod it by the table size). Whenever we increment that item, we'll just go to the appropriate counter, then increment it. To provide an estimate, we'll just go to that counter and return the value stored there.
One way of thinking about how the count-min sketch works is to imagine each counter as a "bucket" holding all items of some type. We then estimate how frequent something is by seeing how many items are in its bucket, regardless of what those items are, as shown here:
As you can see, we get reasonably good estimates for frequent items, while infrequent items might have their frequencies grossly overestimated. This also gives a nice intuition as to why the count-min sketch never underestimates the frequencies of the elements. You'll always at least count the item itself when looking in its bucket.
If we make some reasonable assumptions about the hash functions we're using, we can formally prove that, on average, the estimate given back for an item's frequency is at most its actual frequency, plus the total number of items divided by the total number of counters. And that makes intuitive sense. If you have lots and lots of counters, at some point each item gets its own counter and the estimates will be perfectly accurate. On the other extreme, if you have almost no counters, then you'd expect all the items to be crammed into a small number of buckets, and the totals will start to be way off.
While this approach guarantees that on expectation the estimates returned will be good, that doesn't mean that you have a high probability of getting a good estimate. One way to think about this is to think about what it means to have a good estimate "on expectation." Intuitively, that would mean that if you were to build a bunch of these data structures, then the average of the estimates would probably be pretty good. However, any one individual estimate is not necessarily going to be all that good.
The count-min sketch takes this idea to heart and, instead of just having one array of counters, maintains several independent arrays of counters, each of which has a different hash function dropping items into buckets. This gives a degree of redundancy and means that it's highly unlikely that you'll get "unlucky" with items colliding in bad ways.
To get its overall estimate, the count-min sketch does something a bit more clever than just averaging the estimates together. Remember that the count-min sketch can never underestimate the actual frequencies of the items being stored, and can only overestimate them. This means that if you have a collection of different estimates coming back, the larger an estimate is, the worse it is. Why? Because the larger the estimate, the more items other than the one we care about it's counting. (Think back to buckets - if a bucket has a bunch of other items in it besides the one we care about, we don't want to use the size of that bucket as an estimate).
This is where the "min" part of "count-min sketch" comes in. When returning an estimate, the count-min sketch returns the minimum estimate among all the estimates generated. This is the estimate with the least noise in it. Intuitively, this is very likely to give you a good estimate - the only way it fails is if every single estimate was bad, which is fairly unlikely.
This means that the overall data structure, and the logic to manipulate it, is fairly straightforward:
Learning More
There's more to explore about the count-min sketch. For example, how do you formally analyze a count-min sketch to determine how many counters you need per row or how many independent structures you'll need? What kinds of hash functions can you use? To learn more about that, check out these lecture slides, which go into some detail on the topic.
What happens if you want to support both increments and decrements? How do you use the count-min sketch to find the most frequent elements in a data stream? The original paper on the topic is a good resource here.
Can you get the same results as a count-min sketch without randomness? Yes, using some clever number theory.