Questions about the frequency table in arithmetic encoding compression

56 Views Asked by At

I have a high level idea of how arithmetic coding works, but in the sources I've read (https://neptune.ai/blog/lossless-data-compression-using-arithmetic-encoding-in-python-and-its-applications-in-deep-learning and https://go-compression.github.io/algorithms/arithmetic/) it's not clear to me what the relationship between the frequency table and the message we're compressing is.

My understanding is the frequency table is determined from some universe of messages, and thus should be representative of messages you'll encode, in expectation. Is this understanding correct?

My second question pertains to why characters with larger frequencies (and thus larger probabilities) take up a large section of the interval. I don't believe this has anything to do with accuracy. Does it somehow speed up the algorithm? Why not make every character have the same probability?

1

There are 1 best solutions below

5
user22405329 On

I start with the second question: no, it has nothing to do with speed. Using the probabilities is the very thing that actually does the compression!

The main idea of the compression: use few bits for common messages, and use many bits for rare messages. In the ideal case, we should use exactly 1 bit for a message with 50% probability. So compression actually crucially depends on knowing the frequencies.

Arithmetic coding implements almost ideal coding - it produces a bit for each time a probability range is reduced by half. Low probability symbols produce a smaller range - so they output more bits. High probability symbols produce less bits. Symbols with probability >50% may produce no bits - it would take several such symbols for a single bit!

There are many ways to calculate a frequency table - but it should match the real distribution of the input as close as possible. The closer they are - the better compression you would get. Note, that both encoder and decoder must use the same frequencies!

  • You could just pre-calculate the expected frequencies and hardcode them into both encoder and decoder. However, beware - if the input doesn't follow expectations - you'll get no compression, or even get an "expansion"!
  • You could calculate frequencies from the input, and pass the values to the decoder. Of course, storing the table will increase the compressed size.
  • You could calculate the frequencies from the previous input, and live-update them as you compress. Arithmetic coding allows changing frequencies at any time.