Intuitive Explanation of Solomonoff Induction

Update: Alex Altair has now finished this tutorial!

A while ago I began writing a Solomonoff Induction tutorial, but now it’s one of those Less Wrong articles I probably will never have time to write.

But, maybe a few of you can take what I’ve started, team up, and finish the thing. I’m basically just following along with the Solomonoff induction tutorials from Shane Legg (2008, 2004), but making them simpler and readable by a wider audience. I think this would be a valuable thing for Less Wrong to have, and one that would draw even more traffic to our community, but I don’t have time to write it myself anymore!

Who wants to be a hero and finish this thing? The original markdown source is available here.

This is a tutorial for beginners. Those familiar with Solomonoff induction may prefer to read Rathmanner & Hutter (2011).

People disagree about things. Some say that vaccines cause autism; others say they don’t. Some say everything is physical; others believe in a god. Some say that complicated financial derivatives are essential to a modern competitive economy; others think a nation’s economy will do better without them. It’s hard to know what is true.

And it’s hard to know how to figure out what is true. Some think you should assume the things you are most certain about and then deduce all other beliefs from your original beliefs. Others think you should accept at face value the most intuitive explanations of your personal experience. Others think you should generally agree with the scientific consensus until it is disproven.

Wouldn’t it be nice if finding out what’s true was like baking a cake? What if there was a recipe for finding out what is true? All you’d have to do is follow the written instruction exactly, and after the last instruction you’d inevitably find yourself with some sweet, tasty truth!

As it turns out, there is an exact recipe for finding truth. It was discovered in the 1960s.

The problem is that you don’t have time to follow the recipe. To find the truth to even a simple question using this recipe would require you to follow one step after another until long after the heat death of the universe, and you can’t do that.

But we can find shortcuts. Suppose you know the exact recipe for baking a cake requires you to count out one molecule of H2O at a time until you have exactly 0.5 cups of water. If you did that, you might not finish before the heat death of the universe. But you could approximate that part of the recipe by measuring out something very close to 0.5 cups of water, and you’d probably still end up with a pretty good cake.

Similarly, once we know the exact recipe for finding truth, we can try to approximate it in a way that allows us to finish all the steps sometime before the heat death of the universe.

This tutorial explains the exact recipe for finding truth, Solomonoff induction, and also explains some attempts to approximate it in ways that allow us to figure out now what is (probably) true. Fear not; I shall not assume you know anything beyond grade-school math.

Like my Crash Course in the Neuroscience of Human Motivation, this tutorial is long. You may not have time for it; that’s fine. But if you do read it, I recommend you read it in sections.

Contents:

  1. Algorithms

  2. Induction

  3. Occam’s Razor

  4. Updating Beliefs

  5. The Problem of Priors

  6. Binary Sequence Prediction

  7. Solomonoff and Kolmogorov

  8. Solomonoff’s Lightsaber

  9. Approximations

  10. Philosophical Implications

Algorithms

At an early age you learned a set of precisely-defined steps — a ‘recipe’ or, more formally, an algorithm — that you could use to find the largest number in an unsorted list of numbers like this:

21, 18, 4, 19, 55, 12, 30

The algorithm you learned probably looked something like this:

  1. Note the leftmost item as the largest you’ve seen in this list so far. If this is the only item on the list, output it as the largest number on the list. Otherwise, proceed to step 2.

  2. Look at the next item. If it is larger than the largest item noted so far, note it as the largest you’ve seen in this list so far. Proceed to step 3.

  3. If you have not reached the end of the list, return to step 2. Otherwise, output the last noted item as the largest number in the list.

Other algorithms could be used to solve the same problem. For example, you could work your way from right to left instead of from left to right. But the point is that if you follow this algorithm exactly, and you have enough time to complete the task, you can’t fail to solve the problem. You can’t get confused about what one of the steps means or what the next step is. Every instruction tells you exactly what to do next, all the way through to the answer.

You probably learned other algorithms, too, like how to find the greatest common divisor of any two integers (see image on right).

But not just any set of instructions is a precisely-defined algorithm. Sometimes, instructions are unclear or incomplete. Consider the following instructions based on an article about the scientific method:

  1. Make an observation.

  2. Form a hypothesis that explains the observation.

  3. Conduct an experiment that will test the hypothesis.

  4. If the experimental results disconfirm the hypothesis, return to step #2 and form a hypothesis not yet used. If the experimental results confirm the hypothesis, provisionally accept the hypothesis.

This is not an algorithm.

First, many of the terms are not clearly defined. What counts as an observation? What counts as a hypothesis? What would a hypothesis need to be like in order to ‘explain’ the observation? What counts as an experiment that will ‘test’ the hypothesis? What does it mean for experimental results to ‘confirm’ or ‘disconfirm’ a hypothesis?

Second, the instructions may be incomplete. What do we do if we reach step 4 and the experimental results neither ‘confirm’ nor ‘disconfirm’ the hypothesis under consideration, but instead are in some sense ‘neutral’ toward the hypothesis? These instructions don’t tell us what to do in that case.

An algorithm is a well-defined procedure that takes some value or values as input and, after a finite series of steps, generates some value or values as output.

For example, the ‘find the largest number’ algorithm above could take the input {21, 18, 4, 19, 55, 12, 30} and would, after 13 steps, produce the following output: {55}. Or it could take the input {34} and, after 1 step, produce the output: {34}.

What we’re looking for is a precise algorithm that will produce truth as its output.

Induction

Whether we are a detective trying to catch a thief, a scientist trying to discover a new physical law, or a businessman attempting to understand a recent change in demand, we are all in the process of collecting information and trying to infer the underlying causes.a

The problem of inference is this: We have a collection of observations (data), and we have a collection of hypotheses about the underlying causes of those observations. We’d like to know which hypothesis is correct, so we can use that knowledge to predict future events.

Suppose your data concern a large set of stock market price changes and other events in the world. You’d like to know the processes responsible for the stock market price changes, because then you can predict what the stock market will do in the future, and make some money.

Or, suppose you are a parent. You come home from work to find a chair propped against the refrigerator, with the cookie jar atop the fridge a bit emptier than before. One hypothesis that leaps to mind is that your young daughter used the chair to reach the cookies. However, many other hypothesis explain the data. Perhaps a very short thief broke into your home and stole some cookies. Perhaps your daughter put the chair in front of the fridge because the fridge door is broken and no longer stays shut, and you forgot that your friend ate a few cookies when he visited last night. Perhaps you moved the chair and ate the cookies yourself while sleepwalking the night before.

All these hypotheses are possible, but intuitively it seems like some hypotheses are more likely than others. If you’ve seen your daughter access the cookies this way before but have never been burgled, then the ‘daughter hypothesis’ seems more plausible. If some expensive things from your bedroom and living room are missing and there is hateful graffiti on your door at the eye level of a very short person, then then ‘short thief’ hypothesis seems more plausible than before. If you suddenly remember that your friend ate a few cookies and broke the fridge door last night, the ‘broken fridge door’ hypothesis gains credibility. If you’ve never been burgled and your daughter is out of town and you have a habit of moving and eating things while sleepwalking, the ‘sleepwalking’ hypothesis is less bizarre.

But the weight you give to each hypothesis depends greatly on your prior knowledge. What if you had just been hit on the head and lost all past memories, and for some reason the most urgent thing you wanted to do was to solve the mystery of the chair in front of the fridge door? Then how would weigh the likelihood of the available hypotheses?

Occam’s Razor

Consider a different inductive problem. A computer program outputs the following sequence of numbers:

1, 3, 5, 7

Which number comes next? If you guess correctly, you’ll win $500.

In order to predict the next number in the sequence, you make a hypothesis about the process the computer is using to generate these numbers. One obvious hypothesis is that it is simply listing all the odd numbers in ascending order from 1. If that’s true, you should guess 9 to win the $500.

But perhaps the computer is using a different algorithm to generate the numbers. Suppose that n is the step in the sequence, so that n=1 when it generated ‘1’, n=2 when it generated ‘3’, and so on. Maybe the computer used this equation to calculate each number in the sequence:

2n − 1 + (n − 1)(n − 2)(n − 3)(n − 4)

If so, the next number in the sequence will be 33. (Go ahead, check the calculations.) But doesn’t the first hypothesis seem more likely?

The principle behind this intuition, which goes back to William of Occam, could be stated:

Among all hypotheses consistent with the observations, the simplest is the most likely.

The principle is called Occam’s razor because it ‘shaves away’ unnecessary assumptions.

For example, think about the case of the missing cookies again. In most cases, the ‘daughter’ hypothesis seems to make fewer unnecessary assumptions than the ‘short thief’ hypothesis does. You already know you have a daughter that likes cookies and knows how to move chairs to reach cookies. But in order for the short thief hypothesis to be plausible, you have to assume that (1) a thief found a way to break into your home, that (2) the thief wanted cookies more than anything else from your home, that (3) the thief was, unusually, too short to reach the top of the fridge without the help of a chair, and many other unnecessary assumptions.

Occam’s razor sounds right, but can it be made more precise, and can it be justified? We will return to those questions later. For now, let us consider another important part of induction, that of updating our beliefs in response to new evidence.

Updating Beliefs

You’re a soldier in combat, crouching in a trench. You know for sure there is just one enemy soldier left on the battlefield, about 400 yards away. You also know that if the remaining enemy is a regular army troop, there’s only a small chance he could hit you with one shot from that distance. But if the remaining enemy is a sniper, then there’s a very good chance he can hit you with one shot from that distance. But snipers are rare, so it’s probably just a regular army troop.

You peek your head out of the trench, trying to get a better look.

Bam! A bullet glance off your helmet and you duck down again.

“Okay,” you think. “I know snipers are rare, but that guy just hit me with a bullet from 400 yards away. I suppose it might still be a regular army troop, but there’s a seriously good chance it’s a sniper, since he hit me from that far away.”

After another minute, you dare to take another look, and peek your head out of the trench again.

Bam! Another bullet glances off your helmet! You duck down again.

“Damn,” you think. “It’s definitely a sniper. No matter how rare snipers are, there’s no way that guy just hit me twice in a row from that distance if he’s a regular army troop. He’s gotta be a sniper. I’d better call for support.”

This is an example of updating beliefs in response to evidence, and we do it all the time.

You start with some prior beliefs, and all of them are uncertain. You are 99.99% certain the Earth revolves around the sun, 90% confident your best friend will attend your birthday party, and 40% sure that the song on the radio you’re listening to was played by The Turtles.

Then, you encounter new evidence — new observations — and update your beliefs in response.

Suppose you start out 85% confident the one remaining enemy soldier is not a sniper, leaving only 15% credence to the hypothesis that he is a sniper. But then, a bullet glances off your helmet — an event far more likely if the enemy soldier is a sniper than if he is not. So now you’re only 40% confident he’s not a sniper, and 60% confident he is. Another bullet glances off your helmet, and you update again. Now you’re only 2% confident he’s not a sniper, and 98% confident he is a sniper.

Now, I’m about to show you some math, but don’t run away. The math isn’t supposed to make sense if you haven’t studied it. Don’t worry; I’m going to explain it all.

There’s a theorem in probability theory that tells you how likely one observation is given some other observations. It’s called Bayes’ Theorem.

At this point you may want to take a break and read either tutorial #1, tutorial #2, tutorial #3, or tutorial #4 on Bayes’ Theorem. I’ll only say a little bit more about Bayes’ Theorem in this tutorial.

The short form of Bayes’ Theorem looks like this:

Let’s unpack this. The H refers to some hypothesis, and the E responds to some evidence. p(x) refers to the probability of x. The pipe symbol | means ‘given’ or ‘assuming’. So, Bayes’ Theorem reads:

[The probability of hypothesis H given evidence E] is equal to [the probability of evidence E given hypothesis H] times [the probability of hypothesis H] divided by [the probability of evidence E].

You can see how this tells us what we’d like to know. We want to know the probability of some hypothesis — some model of the world that, if correct, will allow us to successfully predict the future — given the evidence that we have. And that’s what Bayes’ Theorem tells us. Now we just plug the numbers in, and solve for p(H|E).

Of course, it’s not easy to “just plug the numbers in.” You aren’t an all-knowing god. You don’t know exactly how likely it is that the enemy soldier would hit your helmet if he’s a sniper, compared to how likely that is if he’s not a sniper. But you can do your best. With enough evidence, it will become overwhelmingly clear which hypothesis is correct.

At this point, I’ll let the tutorials on Bayes’ Theorem above do the heavy lifting on how to update beliefs. Let me get back to the part of induction those tutorials don’t explain: the choice of priors.

The Problem of Priors

In the example above where you’re a soldier in combat, I gave you your starting probabilities: 85% confidence that the enemy soldier was a sniper, and 15% confidence he was not. But what if you don’t know your “priors”? What then?

When using Bayes’ Theorem to calculate your probabilities, your choice of prior can influence your results greatly. But how can we determine the probability of a hypothesis before seeing any data? What does that even mean?

Legg (2008) explains:

Bayesians respond to this in a number of ways. Firstly, they point out that the problem is generally small, in the sense that with a reasonable prior and quantity of data, the posterior distribution P(h|D) depends almost entirely on D rather than the chosen prior P(h). In fact on any sizable data set, not only does the choice of prior not especially matter, but Bayesian and classical statistical methods typically produce similar results, as one would expect. It is only with relatively small data sets or complex models that the choice of prior becomes an issue.

If classical statistical methods could avoid the problem of prior bias when dealing with small data sets then this would be a significant argument in their favour. However Bayesians argue that all systems of inductive inference that obey some basic consistency principles define, either explicitly or implicitly, a prior distribution over hypotheses. Thus, methods from classical statistics make assumptions that are in effect equivalent to defining a prior. The difference is that in Bayesian statistics these assumptions take the form of an explicit prior distribution. In other words, it is not that prior bias in Bayesian statistics is necessarily any better or worse than in classical statistics, it is simply more transparent.

But this doesn’t solve our problem. It merely points out that other statistical techniques don’t fare any better.

What we’d like is to reduce the potential for abusing one’s opportunity to select priors, and to use agreed-upon priors when possible. Thus, many standard “prior distributions” have been developed. Generally, they distribute probability equally across hypotheses.

But to solve the problem of priors once and for all, we’d like to have an acceptable, universal prior distribution, so that there’s no fuzziness about the process of induction. We need a recipe, and algorithm, for finding out the truth. And there can’t be any fuzziness in an algorithm.

Binary Sequence Prediction

[intro to binary sequence prediction]

Solomonoff and Kolmogorov

[summarize the contributions of Solomonoff and Kolmogorov]

Solomonoff Lightsaber

[explain the result: Solomonoff Induction]

Approximations

[survey a few of the approximations of Solomonoff Induction in use today]

Philosophical Implications

[survey of some philosophical implications of unviersal induction]

Notes

a This quote and some of the examples to follow are from Legg (2008).

References

Legg (2008). Machine Superintelligence. PhD thesis, Department of Informatics, University of Lugano.

Rathmanner & Hutter (2011). A philosophical treatise of universal induction.