I understand the rabin-karp algo and its usage in string searching. What I don't quite understand is how it can dynamically slice a file into variable-length chunks. It's said to calculate the hash of a small window of data bytes (ex: 48 bytes) at every single byte offset, and the chunk boundaries—called breakpoints—are whenever the last N (ex: 13) bits of the hash are zero. This gives you an average block size of 2^N = 2^13 = 8192 = 8 KB. Questions:
- Does the rabin-karp rolling hash start from the first 48 bytes, then roll over one byte each time.
- If so, is it too much to calculate for a large file even with simple hash function?
- Given unpredictable data, how is it possible to have N bits of the hash are zero within the large chunk size limit?