Compression of ordered integer list

75 Views Asked by At

I'm reading the book "Introduction to information retrieval" and I have some practical doubt. In the book there is a chapter dedicated to the compression of the index (dictionary and posting list). For the posting list an explanation of the document list is explain.

Example: The term "book" has a posting list for example "1 4 7 19 20 25" which means that it is contained in this documents ids.

In this case the gap between the ids is computed "1 3 3 12 1 5" so then with a linear scan we can retrive each document by summing the previous.

Then is explained that the gap between two DocID will be small in most of the case and it's possible to use an seconder like (Variable byte codes or gamma encoding).

With gamma encoding we obtain:

  • 1 -> 0
  • 3 -> 101
  • 3 -> 101
  • 12 -> (binary 1100) -> (length(100) = 3) -> (unary(3) 1110) -> unary + length = 1110100 which is the gamma for 12
  • 1 -> 0
  • 5 -> 110101

So the result of my compression with gamme encoding for "1 3 3 12 1 5" is "010110111101000110101"

My doubt is how to get in practice this compression? I'm not understanding practically how to store this data to reduce the space complexity. Because now I'm using pickle to store the dictionary that I load from a txt file without the benefit of having a list of ordered integer.

def pickle_data(path: str, posting_list):
    out = open(path, 'wb')
    pickle.dump(posting_list, out)
    out.close()

Someone can explain how use this benefit of an ordered integer with (at most) small integer?

1

There are 1 best solutions below

4
Mark Adler On

Your 010110111101000110101 is a sequence of bits. You have not shown any code for how str is generated or what it is. If you are storing it as a string of 0 and 1 characters, then you have unnecessarily expanded the length of the binary sequence by a factor of eight, defeating the compression you are seeking.

You need to pack the bits into a series of bytes, eight bits per byte, in order to realize the compression. You will also need a way to terminate the sequence, i.e., know when it ends. Without that, you could end up with extraneous output from the last byte that was only partially filled with bits.