# Contra Anton đ´ââ ď¸ on Kolmogorov complexity and recursive self improvement

Twitter user @atroyn claims that recursive self-improvement is impossible because of Kolmogorov complexity. Quoting most of^{[1]} the argument here:

here is an argument against the possibility of recursive self improvement of any âintelligentâ computer program, based on kolmogorov complexity.

intelligence is the ability to make correct predictions about the state of the world given available information.

each program which makes predictions about the world has a kolmogorov complexity corresponding to the length of the shortest string which can express that program

for a given program p call this complexity k.

(unfortunately k(p) is in general uncomputable, the proof reduces to the halting problem, but thatâs not important here)

more intelligence (in our definition) implies the ability to predict more of the world more accurately, i.e. to express more of the worldâs complexityâthis implies that a more intelligent program p2 necessarily has more complexity than a less intelligent p1

to see that this is necessarily so, note that if we could predict the world equally accurately as p1â˛s prediction with a program p0 with k0 < k1, then we have a contradiction since k1 was supposed to be the minimal expression of intelligence at that level

in order to get recursive self improvement, you need a program p1 which is capable of emitting p2 which is better able to predict the world than p1 - i.e., we need p1 to emit p2 such that k2 > k1

but this is a contradiction.

[...]

The mistake here is the assumption that a program that models the world better necessarily has a higher Kolmogorov complexity. Originally, Kolmogorov complexity measured the complexity of bit strings. But weâre talking about predictors here, things that observe the world and spit out probability distributions over observed outcomes. In the context of predictors, Kolmogorov complexity measures the complexity of a function from observations to predictions.

In the case of ideal Bayesian reasoning, we can nail down such a function just by specifying a prior, eg. the Solomonoff prior. (Plus, an approximation scheme to keep things computable, I guess.) This doesnât take a very large program to implement. But a non-ideal reasoner will screw up in many cases, and thereâs information contained in the exact way it screws up for each set of observations. Such reasoners can have an almost arbitrarily high Kolmogorov complexity, and theyâre all worse than the ideal Bayesian program.

In other words, the successor program has Kolmogorov complexity less than or equal to that of its predecessor, but so what? That doesnât imply that itâs worse.

(Also, Kolmogorov complexity doesnât care about how much time a program takes to run at all, but in the real world itâs an important consideration, and a target for self-improvement.)

That concludes this post: without the assumption that higher Kolmogorov complexity is better, the whole argument falls apart.

- âŠď¸
The rest of the thread briefly touches on the issue of

*how an AI could know that its successor would necessarily be an improvement*. The discussion there is kind of doomed since itâs done with the goal of showing that the successor has lower or equal Kolmogorov complexity than the original, which is uninteresting, though we can see right away that it*must*be true, assuming that the original writes the successor before observing the world at all. But thereâs an interesting version of the question, which asks about the set of axioms used by the systems to reason about the world, rather than the Kolmogorov complexity. See this paper by Yudkowsky and Herreshoff for details.

For an argument like this, the author needs to immediately show that it doesnât âprove too muchâ. I.e evolution is impossible, a child learning with a growing brain also...

I had a kinda different take (copied from twitter)

Then after sleeping on it I tweeted again:

I think the OP here is also valid (and complementary).

Each of these points look valid, but thereâs a much simpler refutation: ÂŤ Any good enough intelligence is smart enough to distribute part of its cognition to external devices. Âť.

Application: either my code includes wikipedia and whoever might change wikipedia just before I consult it, or itâs Kolmogorov complexity does not fully capture my capabilities. In a sense, this is showing the impact of putting too much confidence on a debatable picture of our capabilities and limitations as a single agent working from some cockpit.

(Reaction to the first sentence: âIs this going to be an argument that would imply that humans canât improve their own intelligence?â)

Yeah, his first wrong statement in the argument is âa more intelligent program p2 necessarily has more complexity than a less intelligent p1â. I would use an example along the lines of âp1 has a hundred data points about the path of a ball thrown over the surface of the Moon, and uses linear interpolation; p2 describes that path using a parabola defined by the initial position and velocity of the projectile and the gravitational pull at the surface of the Moonâ. Or ârigid projectiles A and B will collide in a vacuum, and the task is to predict their paths; p1 has data down to the atom about projectile A, and no data at all about projectile B; p2 has the mass, position, and velocity of both projectilesâ. Or, for that matter, âp1 has several megabytes of

incorrectdata which it incorporates into its predictionsâ.It seems he may have confused himself into assuming that p1 is the

most intelligent possibleprogram of Kolmogorov complexity k1. (He later says ââŚ then we have a contradiction since k1 was supposed to be the minimal expression of intelligence at that levelâ. Wrong; k1 was supposed to be the minimal expression ofthat particular intelligence p1, not the minimal expression of some set of possible intelligences.) Then itwouldfollow that any more intelligent (i.e. better-predicting, by his definition) program must be more complex.Seems to me that the usage of Kolmogorov complexity in this context is a red herring. Complexity of what: the program

alone, or the programand the data it gets? The former is irrelevant, because the entire idea is that an intelligence at human level or higher can observe the environment andlearn from it. The latter, assuming that we can make an unlimited number of observations and experiments, is potentially unlimited.Mathematically speaking, a

universalprogram (an interpreter that can simulate an arbitrary program described in its data)hasa constant Kolmogorov complexity, and yet cansimulatea program with arbitrarily high Kolmogorov complexity. (The extra complexity is in the data describing the simulated program.)If we taboo âKolmogorov complexityâ, it seems to me that the argument reduces to: âa machine cannot self-improve, because it can only build the machines it could simulate, in which case whatâs the point of actually building them?â. Which, in some sense yes (assuming unlimited computing power and time), but the machine that is actually built can hypothetically run

muchfaster than the simulated one.As has been observed by other commenters, the argument fails to take into account runtime limitationsâin the real world programs can self improve by finding (provably) faster programs that (provably) perform the same inferences that they do, which most people would consider self improvement. However the argument may be onto something: it is indeed true that a program p cannot output a program q with K(q) > K(p) by more than a constant (there is a short program which simulates any input program and then runs its output). Here K(p) is the length of the shortest program with the same behavior as pâin this case we seem to require p to both output another program q and learn to predict a sequence. It is also true that a high level of Kolmogorov complexity is required to eventually predict all sequences up to a high level of complexity: https://ââarxiv.org/ââpdf/ââcs/ââ0606070.pdf.

The real world implications of this argument are probably lessened by the fact that predictors are embedded, and can improve their own hardware or even negotiate with other agents for provably superior software.

Perfect. A Turing machine doing Levin Search or running all possible Turing machines is the first example that came to my mind when I read Antonâs argument against RSI-without-external-optimization-bits.

Another good example is the Goedel machine

I think heâs saying âsuppose p1 is the shortest program that gets at most loss x. If p2 gets loss y<x, then we must require a longer string than p1 to express p2, and p1 therefore cannot express p2â.

This seems true, but I donât understand its relevance to recursive self improvement.

I think Anton assumes that we have the simplest program that predicts the world to a given standard, in which case this is not a mistake. He doesnât explicitly say so, though, so I think we should wait for clarification.

But itâs a strange assumption; I donât see why the minimum complexity predictor couldnât carry out what we would interpret as RSI in the process of arriving at its prediction.

The thing about the Pareto frontier of Kolmogorov complexity vs prediction score is that most programs arenât on it. In particular, it seems unlikely that p_1, the seed AI written by humans, is going to be on the frontier. Even p_2, the successor AI, might not be on it either. We canât equovicate between all programs that get the same prediction score, differences between them will be observable in the way they make predictions.

I donât disagree with any of what you say hereâI just read Anton as assuming we have a program on that frontier