I'd like to compress a DB containing a text column (< 20 char) + an integer ID, to minimize disk space and still be able to perform a fast search thanks to an index. There are billions of rows.
Here below is an attempt with a corpus of text randomly generated with the letters abcdefghijk + a few patterns foo, foo1234, ...
These recurrent patterns show it would be possible to compress the data. (In real data, these patterns are by hundreds of thousands).
The following example with 100k strings shows that:
- these strings take ~11 bytes on average,
- once in the DB (with index), it takes ~40 bytes on average (factor x4)
- once in the DB (without index), it takes ~20 bytes on average
Which options are available to compress the data?
Not possible:
- http://www.sqlite.org/zipvfs because it's quite expensive
- https://github.com/salviati/sqlite3-lz4 doesn't seem easily available on Python 3 + Windows
- https://github.com/mlin/sqlite_zstd_vfs and https://github.com/phiresky/sqlite-zstd not easily available on Windows
Which other options are there?
Can we INSERT lz4.compress(s) instead of s in the DB?
Then how to keep the index on s to be able to do fast queries (and not on the compressed data)?
N = 100_000
import sqlite3, random, time, os
db = sqlite3.connect('test.db')
db.execute("CREATE TABLE data(id INTEGER PRIMARY KEY AUTOINCREMENT, a TEXT);")
db.execute("CREATE UNIQUE INDEX idx ON data(a);")
total_size = 0
for i in range(N):
s = "".join(random.choice(["foo", "foo1234", "bar", "hello", "whatsup"] + list("abcdefghijk")) for _ in range(5))
total_size += len(s)
db.execute("INSERT OR IGNORE INTO data (a) VALUES (?)", (s,))
db.commit()
print(f"{N} strings, raw data: ~ {total_size / N} bytes per string")
print(f"DB on disk: ~ {os.path.getsize('test.db') / N} bytes per string")
t0 = time.time()
L = []
L = list(db.execute("SELECT * FROM data WHERE a >= 'd' AND a < 'e';"))
print(f"query returned {len(L)} results in {time.time() - t0} sec")
Output:
100000 strings, raw data: ~ 11.25016 bytes per string
DB on disk: ~ 40.63232 bytes per string
query returned 6004 results in 0.008000612258911133 sec