Compress a database but keep fast prefix search with index

142 Views Asked by Basj At 25 October 2022 at 19:50

I'd like to compress a DB containing a text column (< 20 char) + an integer ID, to minimize disk space and still be able to perform a fast search thanks to an index. There are billions of rows.

Here below is an attempt with a corpus of text randomly generated with the letters abcdefghijk + a few patterns foo, foo1234, ... These recurrent patterns show it would be possible to compress the data. (In real data, these patterns are by hundreds of thousands).

The following example with 100k strings shows that:

these strings take ~11 bytes on average,
once in the DB (with index), it takes ~40 bytes on average (factor x4)
once in the DB (without index), it takes ~20 bytes on average

Which options are available to compress the data?

Not possible:

http://www.sqlite.org/zipvfs because it's quite expensive
https://github.com/salviati/sqlite3-lz4 doesn't seem easily available on Python 3 + Windows
https://github.com/mlin/sqlite_zstd_vfs and https://github.com/phiresky/sqlite-zstd not easily available on Windows

Which other options are there?

Can we INSERT lz4.compress(s) instead of s in the DB? Then how to keep the index on s to be able to do fast queries (and not on the compressed data)?

N = 100_000
import sqlite3, random, time, os
db = sqlite3.connect('test.db')
db.execute("CREATE TABLE data(id INTEGER PRIMARY KEY AUTOINCREMENT, a TEXT);")
db.execute("CREATE UNIQUE INDEX idx ON data(a);")
total_size = 0
for i in range(N):
    s = "".join(random.choice(["foo", "foo1234", "bar", "hello", "whatsup"] + list("abcdefghijk")) for _ in range(5))
    total_size += len(s)
    db.execute("INSERT OR IGNORE INTO data (a) VALUES (?)", (s,))
db.commit()
print(f"{N} strings, raw data: ~ {total_size / N} bytes per string")
print(f"DB on disk: ~ {os.path.getsize('test.db') / N} bytes per string")
t0 = time.time()
L = []
L = list(db.execute("SELECT * FROM data WHERE a >= 'd' AND a < 'e';"))
print(f"query returned {len(L)} results in {time.time() - t0} sec")

Output:

100000 strings, raw data: ~ 11.25016 bytes per string
DB on disk: ~ 40.63232 bytes per string
query returned 6004 results in 0.008000612258911133 sec

Original Q&A

Compress a database but keep fast prefix search with index

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in DATABASE

Related Questions in SQLITE

Related Questions in COMPRESSION

Related Questions in LZ4

Trending Questions

Popular # Hahtags

Popular Questions