How to store string generated from Huffman tree to a text file

Question

How to store string generated from Huffman tree to a text file

232 Views Asked by Falcon At 20 February 2023 at 12:38

I am trying to implement Huffman Coding in C++ for text file compression. I am able to build huffman tree from the frequencies of each character in the file. When I try to traverse the tree and get huffman codes for different characters, I am storing the huffman codes as string, so the output string is getting bigger than the input string.

unordered_map<char, string> encoding;

void store_huffman_codes(Node* root, string s){
    if(root == NULL) return;
    if(root->val != '$') encoding[root->val] = s;
    store_huffman_codes(root->left, s + '0');
    store_huffman_codes(root->right, s + '1');
}

unordered_map<char, int> m;
for(char c : test) m[c]++;
priority_queue<pair<int, Node*>, vector<pair<int, Node*>>, greater<pair<int, Node*>>> pq;
for(auto x : m){
    Node* temp = new Node(x.first);
    pq.push({x.second, temp});
}
while(pq.size() > 1){
    pair<int, Node*> a = pq.top(); pq.pop();
    pair<int, Node*> b = pq.top(); pq.pop();
    Node* temp = new Node('$');
    int val = a.first + b.first;
    temp->left = a.second; temp->right = b.second;
    pq.push({val, temp});
}
Node* root = pq.top().second;
store_huffman_codes(root, "");

string output = "";
for(char c : test){
    output += encoding[c];
}

How to store the codes in binary rather than string?

Original Q&A

There are 1 best solutions below

**Aconcagua** · Accepted Answer · 2023-02-20T13:35:30.170000

The point here is that you use entire bytes for storing just one single bit.

Instead you should compress multiple bits into one single byte; there arises a question with, though: How to handle the unused bits you cannot fill for not having sufficient data (i.e. data length not being a multiple of byte size in bits)?

You could do something similar to utf-8 for encoding multi-byte sequences: The number of leading one bits in a byte indicates the number of unused bits. Advantage: All information required for encoding is stored in one single byte. Disadvantage: You only can use 7 bits to encode in all bytes preceding the last one – which probably over-weighs the advantage.

Alternatively you store the number of used or unused bits in a separate byte; my recommendation: Number of unused bits in the very first data byte and skipping the unused bytes right at the beginning (i.e. least significant bits in second byte of the output), which might then look as follows:

uint8_t byte = (8 - numberOfNodes % 8) % 8;
// assuming you tracked...
// second modulo maps 8 to 0, if that occurs

// output byte to file (number of unused bits)!

unsigned index = byte;
byte = 0;

auto append = [&encoded, &byte, &index](uint8_t bit)
{
    byte |= bit << index;
    if(++index == 8)
    {
        encoded.push_back(byte);
        index = 0;
        byte = 0;
    }
}

// replace s + 'X' by append(X)

At this point you'll notice that, additionally to the already encoded data, you need to forward byte and index from one recursive call to the next as well; doing so by parameters appears unhandy to me, though, instead I recommend writing a dedicated class for the entire process:

class Encoder
{
public:
    // suitable interface allowing to add new bytes

    // part of the public interface is a member function to trigger encoding,
    // e.g. one of:
    std::vector<uint8_t> const& encode();
    bool /* or some kind of error code */
    writeToFile(std::string path);

private:
    Node* root; // implementation detail, should not be accessible from outside!

    std::vector<uint8_t> encoded;
    // preferably instead of std::string – you're not storing text, but binary data!

    uint8_t byte;
    unsigned int index;

    // now instead of the lambda above I'd actually prefer a member function:
    void append(uint8_t bit)
    {
        // ...
    }
};

encode would now calculate and append the first byte indicating the number of unused bits and initialise byte and index appropriately as shown before, then start iterating recursively over the nodes, beginning with root, just as you did yourself, too – with the minimal change applied as indicated above.

With this, decoding gets just as simple: Read this initial byte, initialise some index to this number and start iterating the further bytes, for each one getting the bit by (byte & 1u << index++) != 0 or alternatively by uint8_t bit = byte & 1u; ++index; byte >>= 1; (though building the tree top down might not be the most efficient variant, but at least it's rather easy to implement).

How to store string generated from Huffman tree to a text file

There are 1 best solutions below

Related Questions in C++

Related Questions in HUFFMAN-CODE

Related Questions in HUFFMAN-TREE

Trending Questions

Popular # Hahtags

Popular Questions