I am trying to implement Huffman Coding in C++ for text file compression. I am able to build huffman tree from the frequencies of each character in the file. When I try to traverse the tree and get huffman codes for different characters, I am storing the huffman codes as string, so the output string is getting bigger than the input string.
unordered_map<char, string> encoding;
void store_huffman_codes(Node* root, string s){
if(root == NULL) return;
if(root->val != '$') encoding[root->val] = s;
store_huffman_codes(root->left, s + '0');
store_huffman_codes(root->right, s + '1');
}
unordered_map<char, int> m;
for(char c : test) m[c]++;
priority_queue<pair<int, Node*>, vector<pair<int, Node*>>, greater<pair<int, Node*>>> pq;
for(auto x : m){
Node* temp = new Node(x.first);
pq.push({x.second, temp});
}
while(pq.size() > 1){
pair<int, Node*> a = pq.top(); pq.pop();
pair<int, Node*> b = pq.top(); pq.pop();
Node* temp = new Node('$');
int val = a.first + b.first;
temp->left = a.second; temp->right = b.second;
pq.push({val, temp});
}
Node* root = pq.top().second;
store_huffman_codes(root, "");
string output = "";
for(char c : test){
output += encoding[c];
}
How to store the codes in binary rather than string?
The point here is that you use entire bytes for storing just one single bit.
Instead you should compress multiple bits into one single byte; there arises a question with, though: How to handle the unused bits you cannot fill for not having sufficient data (i.e. data length not being a multiple of byte size in bits)?
You could do something similar to utf-8 for encoding multi-byte sequences: The number of leading one bits in a byte indicates the number of unused bits. Advantage: All information required for encoding is stored in one single byte. Disadvantage: You only can use 7 bits to encode in all bytes preceding the last one – which probably over-weighs the advantage.
Alternatively you store the number of used or unused bits in a separate byte; my recommendation: Number of unused bits in the very first data byte and skipping the unused bytes right at the beginning (i.e. least significant bits in second byte of the output), which might then look as follows:
At this point you'll notice that, additionally to the already encoded data, you need to forward
byteandindexfrom one recursive call to the next as well; doing so by parameters appears unhandy to me, though, instead I recommend writing a dedicated class for the entire process:encodewould now calculate and append the first byte indicating the number of unused bits and initialisebyteandindexappropriately as shown before, then start iterating recursively over the nodes, beginning withroot, just as you did yourself, too – with the minimal change applied as indicated above.With this, decoding gets just as simple: Read this initial byte, initialise some
indexto this number and start iterating the further bytes, for each one getting the bit by(byte & 1u << index++) != 0or alternatively byuint8_t bit = byte & 1u; ++index; byte >>= 1;(though building the tree top down might not be the most efficient variant, but at least it's rather easy to implement).