how does rabin-karp choose breakpoint in variable-length chunking?

360 Views Asked by ying At 02 July 2025 at 19:12

I understand the rabin-karp algo and its usage in string searching. What I don't quite understand is how it can dynamically slice a file into variable-length chunks. It's said to calculate the hash of a small window of data bytes (ex: 48 bytes) at every single byte offset, and the chunk boundaries—called breakpoints—are whenever the last N (ex: 13) bits of the hash are zero. This gives you an average block size of 2^N = 2^13 = 8192 = 8 KB. Questions:

Does the rabin-karp rolling hash start from the first 48 bytes, then roll over one byte each time.
If so, is it too much to calculate for a large file even with simple hash function?
Given unpredictable data, how is it possible to have N bits of the hash are zero within the large chunk size limit?

Original Q&A

There are 1 best solutions below

coderLMN On 15 April 2021 at 03:10 BEST ANSWER

Yes, the sliding window is fix-sized, moving forward byte by byte.
The hash function has O(n) complexity, in each step it only add (and may shift bits) the next byte and minus the original first byte in the window, which is the core method of Rabin hash.
It depends on the hash function actually. The distribution of the chuck sizes may be different. To reduce chunk size variability, the Two Thresholds, Two Divisors Algorithm (TTTD) was proposed. You can also find some advances in this thread from academic research papers.

how does rabin-karp choose breakpoint in variable-length chunking?

There are 1 best solutions below

Related Questions in ALGORITHM

Related Questions in RABIN-KARP

Trending Questions

Popular # Hahtags

Popular Questions