Compress a database but keep fast prefix search with index

142 Views Asked by At

I'd like to compress a DB containing a text column (< 20 char) + an integer ID, to minimize disk space and still be able to perform a fast search thanks to an index. There are billions of rows.

Here below is an attempt with a corpus of text randomly generated with the letters abcdefghijk + a few patterns foo, foo1234, ... These recurrent patterns show it would be possible to compress the data. (In real data, these patterns are by hundreds of thousands).

The following example with 100k strings shows that:

  • these strings take ~11 bytes on average,
  • once in the DB (with index), it takes ~40 bytes on average (factor x4)
  • once in the DB (without index), it takes ~20 bytes on average

Which options are available to compress the data?

Not possible:

Which other options are there?

Can we INSERT lz4.compress(s) instead of s in the DB? Then how to keep the index on s to be able to do fast queries (and not on the compressed data)?

N = 100_000
import sqlite3, random, time, os
db = sqlite3.connect('test.db')
db.execute("CREATE TABLE data(id INTEGER PRIMARY KEY AUTOINCREMENT, a TEXT);")
db.execute("CREATE UNIQUE INDEX idx ON data(a);")
total_size = 0
for i in range(N):
    s = "".join(random.choice(["foo", "foo1234", "bar", "hello", "whatsup"] + list("abcdefghijk")) for _ in range(5))
    total_size += len(s)
    db.execute("INSERT OR IGNORE INTO data (a) VALUES (?)", (s,))
db.commit()
print(f"{N} strings, raw data: ~ {total_size / N} bytes per string")
print(f"DB on disk: ~ {os.path.getsize('test.db') / N} bytes per string")
t0 = time.time()
L = []
L = list(db.execute("SELECT * FROM data WHERE a >= 'd' AND a < 'e';"))
print(f"query returned {len(L)} results in {time.time() - t0} sec")

Output:

100000 strings, raw data: ~ 11.25016 bytes per string
DB on disk: ~ 40.63232 bytes per string
query returned 6004 results in 0.008000612258911133 sec
0

There are 0 best solutions below