Bitstring Compressibility

a Charles H Bennett paper mentioned (search for "simple counting argument". it's on page 3) that there's a counting argument that most strings can't be compressed, or can only be compressed a little bit. Something like that (from memory). He didn't explain it.

So I thought about what it is.

Consider a bitstring of length 50. What does compressing it mean? It means you have some shorter bitstring which maps to it.

Suppose you were going to compress it to a length 40 bitstring. Could you compress each length 50 bitstring to a length 40 bitstring to save 10 bits of space? No because there are 1024 times more length 50 bitstrings than length 40 bitstrings. So each length 40 bitstring has to, on average, be mapped to 1024 length 50 bitstrings. So by removing 10 bits you're also losing 10 bits of information (you're specifying the target with 1/1024 ambiguity). Given a 40 bit string, you don't know WHICH 50 bit string it's shorthand for b/c there are 1024 it could be shorthand for, if you use a method to compress ALL the 50 bit strings uniformly (for example, by simply cutting off the last 10 bits. or you could cut off the first 10 bits or some other selection of 10 bits that don't even have to be in a row).

so at most we could compress 1/1024th of all 50-bit strings to 40-bit strings. that's a theoretical limit otherwise you have multiple 50-bit strings that you're trying to compress to the same 40-bit string.

you can take a 40-bit string, compress 2 50-bit strings to it, and then add an extra bit saying which one of the multiple mapped strings you mean. like dealing with hash table collisions with a linked list. but that's stupid in this case. it makes more sense to just compress to a 41-bit string. (maybe in some more complex approach it's useful? but it won't fundamentally change the conclusions.)

so what if we were to set up a compression algorithm which uses ALL smaller strings to compress to? we can compress to a 1 bit string, a 2 bit string, etc, all the way up to 49 bits.

well there's half as many 49 bit strings as 50 bit, and there's a quarter as many 48bit strings as 50bit, and an eighth as many 47 bit strings as 50bit, etc. so 1/2 + 1/4 + 1/8 ... = 1. so ALL the prior bitstrings have exactly the right number of possibilities. roughly. but not quite:

a 2-bit string has 4 possibilities. a 1-bit string has 2 possibilities. not enough. even if we count 1 more for a 0-bit string we're still 1 short.

so at least 1 (2 if we can't use a 0bit string) of the 50bit strings can't be compressed.

also, how many can be compressed a lot? well half of them are only getting compressed 1 bit. 1/4 are getting compressed 2 bits. so if we define "a lot" as more than N bits, then it's only 2^-N which get compressed a lot. like if you want 5+ bits of compression, then it's only 1 out of 32 bitstrings that can be compressed that much.

so significant compression is sparse/rare. this is important b/c the field defines random by the content (uncompressible) rather than by the source (how you generated the data). which is interesting b/c it's a "content not source" thing like Critical Rationalism cares about, and it's also not how common sense thinks of random data. common sense says if you flip a coin a bunch, whatever you get is random. but computation theory stuff says if you get 50 tails and that's it, that sequence isn't random. but MOST sequences of randomly generated coinflips will be random (mostly uncompressible) sequences because that's how most sequences are as i was just proving (e.g. at best only 1 in 32 sequences is compressible by 5+ bits).

there's another issue i think is interesting. what if you make 1024 different compression algorithms? each one could compress a different 1/1024th of all the 50bit strings to 40bit strings. however, the overhead for specifying which of the 1024 compression algorithms a particular 40-bit number was encoded with would be 10 bits b/c you need to specify a number from 1-1024 to say which compression algorithm it was. so you don't gain anything. this kinda of overhead is an important issue which gets in the way of attempts to cheat.

Information/Links

Recent Posts

List All Posts

Bitstring Compressibility

Messages