## Mathematical Inconsistency in Solomonoff Induction?

I posted this on Less Wrong 10 days ago. At the end, I summarize the answer they gave.

What counts as a hypothesis for Solomonoff induction? The general impression I’ve gotten in various places is “a hypothesis can be anything (that you could write down)”. But I don’t think that’s quite it. E.g. evidence can be written down but is treated separately. I think a hypothesis is more like a computer program that outputs predictions about what evidence will or will not be observed.

If X and Y are hypotheses, then is “X and Y” a hypothesis? “not X”? “X or Y?” If not, why not, and where can I read a clear explanation of the rules and exclusions for Solomonoff hypotheses?

If using logic operators with hypotheses does yield other hypotheses, then I’m curious about a potential problem. When hypotheses are related, we can consider what their probabilities should be in more than one way. The results should always be consistent.

For example, suppose you have no evidence yet. And suppose X and Y are independent. Then you can calculate the probability of P(X or Y) in terms of the probability of P(X) and P(Y). You can also calculate the probability of all three based on their length (that’s the Solomonoff prior). These should always match but I don’t think they do.

The non-normalized probability of X is 1/2^len(X).

So you get:

P(X or Y) = 1/2^len(X) + 1/2^len(Y) - 1/2^(len(X)+len(Y))

and we also know:

P(X or Y) = 1/2^len(X or Y)

since the left hand sides are the same, that means the right hand sides should be equal, by simple substitution:

1/2^len(X or Y) = 1/2^len(X) + 1/2^len(Y) - 1/2^(len(X)+len(Y))

Which has to hold for any X and Y.

We can select X and Y to be the same length and to minimize compression gains when they’re both present, so len(X or Y) should be approximately 2len(X). I’m assuming a basis, or choice of X and Y, such that “or” is very cheap relative to X and Y, hence I approximated it to zero. Then we have:

1/2^2len(X) = 1/2^len(X) + 1/2^len(X) - 1/2^2len(X)

which simplifies to:

1/2^2len(X) = 1/2^len(X)

Which is false (since len(X) isn’t 0). And using a different approximation of len(X or Y) like 1.5len(X), 2.5len(X) or even len(X) wouldn’t make the math work.

So Solomonoff induction is inconsistent. So I assume there’s something I don’t know. What? (My best guess so far, mentioned above, is limits on what is a hypothesis.)

Also here’s a quick intuitive explanation to help explain what’s going on with the math: P(X) is both shorter and less probable than P(X or Y). Think about what you’re doing when you craft a hypotheses. You can add bits (length) to a hypothesis to exclude stuff. In that case, more bits (more length) means lower prior probability, and that makes sense, because the hypothesis is compatible with fewer things from the set of all logically possible things. But you can also add bits (length) to a hypothesis to add alternatives. It could be this or that or a third thing. That makes hypotheses longer but more likely rather than less likely. Also, speaking more generally, the Solomonoff prior probabilities are assigned according to length with no regard for consistency amongst themselves, so its unsurprising that they’re inconsistent unless the hypotheses are limited in such a way that they have no significant relationships with each other that would have to be consistent, which sounds hard to achieve and I haven’t seen any rules specified for achieving that (note that there are other ways to find relationships between hypotheses besides the one I used above, e.g. looking for subsets).

Less Wrong's answer, in my understanding, is that in Solomonoff Induction a "hypothesis" must make positive predictions like "X will happen". Probabilistic positive predictions – assigning probabilities to different specific outcomes – can also work. Saying X or Y will happen is not a valid hypothesis, nor is saying X won't happen.

This is a very standard trick by so-called scholars. They take a regular English word (here "hypothesis") and define it as a technical term with a drastically different meaning. This isn't clearly explained anywhere and lots of people are misled. It's also done with e.g. "heritability".

Solomonoff Induction is just sequence prediction. Take a data sequence as input, then predict the next thing in the sequence via some algorithm. (And do it with all the algorithms and see which do better and are shorter.) It's aspiring to be the oracle in The Fabric of Reality but worse.

Elliot Temple on September 4, 2020

## Messages (5)

#### Why would code/English or low-abstraction/high-abstraction simplicity or brevity correspond?

https://www.lesswrong.com/posts/9AWoAAA59hN9PEwT7/why-would-code-english-or-low-abstraction-high-abstraction

Solomonoff Induction (SI) focuses on short code. What’s short in English is not necessarily short in code, and vice versa. Most of our intuition in favor of short, simple explanations is from our experience with English, not code. Is there literature arguing that code and English brevity usually or always correspond to each other? If not, then most of our reasons for accepting Occam’s Razor wouldn’t apply to SI.

Another way to think of the issue may be that coding is a low level or reductionist way to deal with an issue, while English is a high level approach that uses high level tools like explanations. Ideas can be represented in many ways, including at different levels of abstraction. It’s unclear that length or simplicity is consistent for the same idea across different levels of abstraction. That is, if you have two ideas X and Y, and X is simpler and shorter when you compare both ideas at a one level of abstraction, it may be unsafe to assume that X will also be simpler and shorter than Y when you compare them at a different level of abstraction. Is there any literature which addresses this?

curi at 12:47 PM on September 4, 2020 | #17815 | reply | quote

SI is just a breadth first search of all possible things that could arise via evolution. It's like an evolutionary algorithm but everything gets to breed and there are no environmental constraints. there's just this one end goal heuristic (the data). if someone introduce a pruning algorithm then we're just implementing parts of evolution. (this is why it resembles CR and yes/no with a dumb prior, the next improvement is to remove the prior!)

one big difference is that evolution looks for *cheap* solutions instead of short ones. it's easier to have a big library of genes but only "need" 10% of them than it is to *only* have 10% of them and limit yourself to that. there are buffers built into mutations/stability via the *regulation* of gene activation.

the reason that cheap solutions are better (evolutionarily) is that have the buffer to develop reach via abstraction - because they're not penalised for length. the genes which are meaningful to mutate will be inputs to modular components. this allows things to invert quickly and allows for compostability.

seen like this, SI finds solutions which are a) not modular, b) not necessarily *cheap*, and c) aren't quick to find.

SI is just a shitty, uncomputable evolutionary algorithm.

Anonymous at 10:09 AM on September 5, 2020 | #17857 | reply | quote

> all possible things that could arise via evolution

SI chooses programs from an alphabet (that's their term), like evolution chooses genomes from an alphabet of genes.

Anonymous at 10:10 AM on September 5, 2020 | #17858 | reply | quote

#17857 I connect the idea of buffers and reach in this post; just want to mention curi came up with it - https://curi.us/2352-chains-bottlenecks-and-optimization

I think it made a lot of sense to include

Anonymous at 10:14 AM on September 5, 2020 | #17859 | reply | quote

Another way to look at SI:

- communication happens

- run programs which generate communication of whatever

- when you can read what the communication is and it makes sense, you found the answer.

Alternatively, lo-fi SI:

- start with an infinite number of monkeys each with their own typewriter.

- let them type stuff.