Algorithm: Minimal Encoding, Error Correction, Please Help?

561 Views Asked by At

Say there's an array of 1024 bits that are all zeros:

example: [0,0,0,0,0,0,0,...]

Then I overwrite 20 zeros with ones at completely random positions:

example: [0,1,0,0,0,0,0,...]

What is the theoretical minimum number of bits needed to encode the location of these 20 randomly placed bits, assuming that I had a perfect encoder?

I know there are communication theory equations that will tell me this, but I want to double check my calculations.

Harder bonus question: Show me the code for an algorithm that implements an encoding that approaches this minimum limit.

Bonus bonus: What if the bit flips where byte level instead of bit level? e.g. entire bytes flipped. Same result?

3

There are 3 best solutions below

0
ErikE On

If you treated a string of 200 bits as an array of twenty 10 bit numbers, each listing the position of one of the one-bits, you'd be saving 824 bits.

But I don't think this is the minimum. For example, if you treat each of the numbers as relative to the previous item instead of an absolute position, some analysis may show that on average you'd only need, say, 8 bits to encode the distance to the next one-bit. So add a bit to the front: when 0, then 200 bits follow with absolute positions. When 1, then 160 bits follow with relative positions. This should yield a lower average number of bits to encode the full value.

Generalizing, this is simply data compression. There are probably many compression algorithms that could reduce the average number of bits required to encode your "twenty one-bits in 1024" to a very small number. Computing a suitable binary tree, storing a representation of it, then storing the bits required to traverse the tree would likely yield a very efficient algorithm (this is in fact the basis of modern data compression).

0
Max Shawabkeh On

If you are going with a dictionary-based encoding where the decoder has the dictionary as well, there's no absolute minimum. For a frequency-based encoding, however, What you need is to calculate the entropy:

E = -(P(0) * log_2(P(0)) + P(1) * log_2(P(1)))
E = -(1004/1024 * log_2(1004/1024) + 20/1024 * log_2(20/1024))
E = 0.1388005

So each bit of the input should require 0.1388005 bits of the output on average. In total:

0.1388005 * 1024 = 142.1317 bits.

This means that in theory, using an optimal algorithm you can encode any string with 1004 zeros and 20 ones (or the other way around) using 143 bits.

9
Darius Bacon On

ceiling(log2(1024 choose 20)) = 139 bits

(calculation on Wolfram Alpha)

The other answers saying 143 bits leave out that we know there are exactly 20 ones. Here's a concrete encoding to show one way to use that knowledge: using arithmetic coding, send each of the 1024 '0' or '1' symbols in succession. The first symbol gets weighted at 20/1024 probability of being '1'; but each later symbol gets weighted differently. If the first symbol was '0', use 20/1023 on the next; but if it was '1', use 19/1023. Continue in the same way to the end. Arithmetic coding does all the hard work to fit in about 139 bits, as long as we tell it the right probabilities.

On the "bonus bonus": error correction wasn't in the original question. You can layer an error-correcting code on top of first finding an optimal encoding assuming no error, like above (and that's normally a good way to break down the problem). You don't lose any coding efficiency that way, though I think you can lose robustness -- as in, if you get more errors than your ECC can correct, will the message come out as total garbage or will it degrade more gracefully?