Dictionary-based compression in Python to compress short strings individually

1.4k Views Asked by At

I have millions of < 20 char strings, and I want to compress each of them individually.

Using zlib or lz4 on each string individually doesn't work: the output is bigger than the input:

inputs = [b"hello world", b"foo bar", b"HELLO foo bar world", b"bar foo 1234", b"12345 barfoo"]
import zlib
for s in inputs:
    c = zlib.compress(s)
    print(c, len(c), len(s))  # the output is larger than the input

Is there a way in Python (maybe with zlib or lz4?) to use a dictionary-based compression, with a custom dictionary size (for example 64 KB or 1 MB) that would allow compression of very short strings individually?

inputs = [b"hello world", b"foo bar", b"HELLO foo bar world", b"bar foo 1234", b"12345 barfoo"]
D = DictionaryCompressor(dictionary_size=1_000_000)
for s in inputs:
    D.update(s)
# now the dictionary is ready    
for s in inputs:
    print(D.compress(s))

Note: "Smaz" looks promising, but it is very much hard-coded and not adaptive: https://github.com/antirez/smaz/blob/master/smaz.c

3

There are 3 best solutions below

0
Claudio On

Why not compressing a list of all of the strings as a way of compressing each one of them?

Consider to write your own Python code for this purpose. It should not be hard to do and it will allow you any possible individual optimization you can imagine.

If you take a library for this purpose ( e.g. pyshoco https://pypi.org/project/pyshoco/#files or any other possible one) you have to 'train' (i.e. adapt) it to your specific data anyway, so I suggest to put this steps together writing an own compressor code which for example as first step builds a dictionary of substrings along with the number of their occurrences from all your short strings to see what makes sense as next step to do.

The most simple compressor code will be to build a list out of a set containing all of your short strings in order to 'compress' them 'individually' to their index in this list. Then compress the list and see see how big the "dictionary" (i.e. the list or an actual dictionary of index:string pairs) will become.

It is so simple that probably hard to see it as a best approach to achieve the desired goal ...

1
Mark Adler On

Python's zlib interface does in fact provide zdict parameters for compressobj and decompressobj, as of version 3.3 (released ten years ago).

You can provide up to a 32K dictionary to aid in compressing short strings. I would also recommend the use of raw deflate streams to minimize the size (wbits=-15).

You can construct the 32K dictionary in many ways. A good starting point would be to simply concatenate a few thousand of your short strings. See if that permits compression of your short strings. Test with strings that are not in your dictionary.

You can also try zstd which should perform better than zlib, and which also supports dictionaries. zstd also has code to help you generate dictionaries. You would need to write your own Python interface to zstd.

I have not tried this, but it may be possible to use zstd's dictionary generation to make a good dictionary for zlib's deflate.

Lastly, I would try to preprocess my short strings based on what I know about them. If there is a way to tokenize contents of the strings that you know will be there, then you will have already compressed them some.

0
Basj On

Example code, as an addition to @MarkAdler's solution:

import zlib
zdict = b"abcdefghijklmnopqrstuvwxyzhelloworldfoobar"
inputs = [b"hello world", b"foo bar", b"HELLO foo bar world", b"bar foo 1234", b"12345 barfoo"]
for s in inputs:
    co = zlib.compressobj(wbits=-15, zdict=zdict)
    c = co.compress(s) + co.flush()
    print(c, len(c), len(c) / len(s))

Here the output is always smaller the input:

b'\x03\xf3\x15\xc0\x02\x00' 6 0.5454545454545454
b'\x03\x92\n@\n\x00' 6 0.8571428571428571
b'\xf3p\xf5\xf1\xf1W\x00\xb2\x15\x80\x1c\x05\xb0\x04\x00' 15 0.7894736842105263
b'\x03"\x05 K\xc1\xd0\xc8\xd8\x04\x00' 11 0.9166666666666666
b'34261U\x002\x80\\\x00' 11 0.9166666666666666