According to RFC1951:
3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
The Huffman codes for the two alphabets appear in the block
immediately after the header bits and before the actual
compressed data, first the literal/length code and then the
distance code. Each code is defined by a sequence of code
lengths, as discussed in Paragraph 3.2.2, above. For even
greater compactness, the code length sequences themselves are
compressed using a Huffman code.
I am really sorry but mostly what i understood is not from RFC1951 but from answers here. The statement above does not give a clear picture to a beginner or any visual context of what the data looks like.
I read that they are two trees from zlib's doc algorithm.txt:
Literals or match lengths are compressed with one Huffman tree, and
match distances are compressed with another tree. The trees are stored
in a compact form at the start of each block.
As we all can see that the later points out clearly than the RFC1951, Which brings the question.
What comes after (HCLEN+4)*3 bits?
I know that you add 4 to the decimal value of the 4 bits HCLEN then extract the exact total of that number from bit stream where each extract is 3 bit number (0 to 7) and Thanks M. Adler for clearing out how bits are read from right to left.
=====================================
bitstream data
_____________________________________
|17|57|????????
Where does the first tree end?
My goal is to identify why 2 inputs sharing a prefix aren't sharing a few bytes somewhere in the compressed stream. I know its because of the Block header but since dynamic blocks construct trees based on input data then something has to be identical in theory because it's encoding not hashing, so the headers themselves should have something similar like this:
raw block data at (19, 3)
Dynamic Huffman tree: length codes: 13, distances codes: 29, code_lengths_length: 16
lengths: [4, 0, 6, 4, 5, 4, 3, 3, 5, 6, 5, 2, 0, 0, 0, 0, 3, 5, 5]
raw block data at (19, 3)
Dynamic Huffman tree: length codes: 9, distances codes: 29, code_lengths_length: 16
lengths: [4, 0, 6, 5, 3, 3, 3, 3, 5, 6, 5, 3, 0, 0, 0, 0, 3, 5, 5]
raw block data at (20, 3)
Dynamic Huffman tree: length codes: 9, distances codes: 30, code_lengths_length: 16
lengths: [3, 0, 5, 0, 6, 3, 4, 5, 6, 6, 6, 2, 0, 0, 0, 0, 2, 5, 5]
deflate-blocktype 2 dyna huff beginning at (16, 0)
raw block data at (16, 3)
Dynamic Huffman tree: length codes: 18, distances codes: 30, code_lengths_length: 16
lengths: [4, 0, 6, 6, 4, 4, 3, 3, 5, 5, 5, 2, 0, 0, 0, 0, 3, 5, 5]
raw block data at (17, 3)
Dynamic Huffman tree: length codes: 18, distances codes: 30, code_lengths_length: 16
lengths: [4, 0, 6, 5, 4, 4, 3, 3, 5, 6, 5, 2, 0, 0, 0, 0, 3, 5, 5]
raw block data at (17, 3)
Dynamic Huffman tree: length codes: 18, distances codes: 30, code_lengths_length: 16
lengths: [4, 0, 5, 5, 4, 3, 3, 3, 4, 5, 5, 3, 0, 0, 0, 0, 3, 5, 5]
4C 9D D5 CE 35 3D 19 40 CF 49
6C 9C D7 B2 F3 4A 15 84 EF A9
4C 9C D7 B2 F3 BA 0D 85 EF 33
94 9D D7 B2 EB 4A 11 86 EF A9
94 9D D7 92 EB BA 11 45 DF 5D
What RFC 1951 says comes after it:
It's not a tree, and I think you mean the second Huffman code description. The first Huffman code description was the
(HCLEN+4)*3bits.The second code description ends when
HLIT + 257lengths from the Huffman codes using the first code description have been read.RFC 1951 goes on to say:
So the third code description ends when
HDIST + 1lengths from the Huffman codes using the first code description have been read.There may be fewer Huffman codes than lengths, since some of the codes represent repeated zero lengths or repeats of the last length.
Also, as noted in RFC 1951:
So it is really a sequence of Huffman codes from the first code description, plus associated extra bits as defined for repeat codes, that describe
HLIT + HDIST + 258lengths. Once you have that many lengths, the header has ended and the first length/literal code of the data follows.