# An Introduction to Löb’s Theorem in MIRI Research

Would you like to see a primer on several MIRI research topics (assuming only the background of having taken a course with proofs in math or computer science)? Or are you curious why MIRI does so much with mathematical logic, and why people on Less Wrong keep referring to Löb’s Theorem?

If you answered yes to either question, you may be interested in my lecture notes, An Introduction to Löb’s Theorem in MIRI Research! These came out of an introductory talk that I gave at a MIRIx workshop.

Since I’ve got some space here, I’ll just copy and paste the table of contents and the introduction section...

## Contents

**1 Introduction **

**2 Crash Course in Löb’s Theorem**

2.1 Gödelian self-reference and quining programs

2.2 Löb’s Theorem

**3 Direct Uses of Löb’s Theorem in MIRI Research **

3.1 “The Löbstacle”

3.2 Löbian cooperation

3.3 Spurious counterfactuals

**4 Crash Course in Model Theory **

4.1 Axioms and theories

4.2 Alternative and nonstandard models

**5 Uses of Model Theory in MIRI Research **

5.1 Reflection in probabilistic logic

**6 Crash Course in Gödel-Löb Modal Logic **

6.1 The modal logic of provability

6.2 Fixed points of modal statements

**7 Uses of Gödel-Löb Modal Logic in MIRI Research **

7.1 Modal Combat in the Prisoner’s Dilemma

7.2 Modal Decision Theory

**8 Acknowledgments**

## 1 Introduction

This expository note is devoted to answering the following question: why do many MIRI research papers cite a 1955 theorem of Martin Löb, and indeed, why does MIRI focus so heavily on mathematical logic? The short answer is that this theorem illustrates the basic kind of self-reference involved when an algorithm considers its own output as part of the universe, and it is thus germane to many kinds of research involving self-modifying agents, especially when formal verification is involved or when we want to cleanly prove things in model problems. For a longer answer, well, welcome!

I’ll assume you have some background doing mathematical proofs and writing computer programs, but I won’t assume any background in mathematical logic beyond knowing the usual logical operators, nor that you’ve even heard of Löb’s Theorem before.

To motivate the mathematical sections that follow, let’s consider a toy problem. Say that we’ve designed Deep Thought 1.0, an AI that reasons about its possible actions and only takes actions that it can show to have good consequences on balance. One such action is designing a successor, Deep Thought 2.0, which has improved deductive abilities. But if Deep Thought 1.0 (hereafter called DT1) is to actually build Deep Thought 2.0 (DT2), DT1 must first conclude that building DT2 will have good consequences on balance.

There’s an immediate difficulty—the consequences of building DT2 include the actions that DT2 takes; but since DT2 has increased deductive powers, DT1 can’t actually figure out what actions DT2 is going to take. Naively, it seems as if it should be enough for DT1 to know that DT2 has the same goals as DT1, that DT2’s deductions are reliable, and that DT2 only takes actions that it deduces to have good consequences on balance.

Unfortunately, the straightforward way of setting up such a model fails catastrophically on the innocent-sounding step “DT1 knows that DT2’s deductions are reliable”. If we try and model DT1 and DT2 as proving statements in two formal systems (one stronger than the other), then the only way that DT1 can make such a statement about DT2’s reliability is if DT1 (and thus both) are in fact unreliable! This counterintuitive roadblock is best explained by reference to Löb’s theorem, and so we turn to the background of that theorem.

I am stuck at part 2.2: “So in particular, if we could prove that mathematics would never prove a contradiction, then in fact mathematics would prove that contradiction”

I’ve spend 15 minutes on this, but still cannot see relation to löb’s theorem. Even though it seems like it should be obvious to attentive reader.

Could anyone please explain or link me to an explanation?

Suppose “mathemathics would never prove a contradiction”. We can write it out as

¬Provable(⊥). This is logically equivalent toProvable(⊥) → ⊥, and it also impliesProvable(Provable(⊥) → ⊥)by the rules of provability. But Löb’s theorem expresses∀ A (Provable(Provable(A) → A) → Provable(A)), which we can instantiate toProvable(Provable(⊥) → ⊥)→ Provable(⊥), and now we can apply modus ponens and our assumption to get aProvable(⊥).Wasn’t Löb’s theorem ∀ A (Provable(Provable(A) → A) → Provable(A))? So you get Provable(⊥) directly, rather than passing through ⊥ first. This is good, as, of course, ⊥ is always false, even if it is provable.

You’re right, I mixed it up. Edited the comment.

Thanks for asking! I should add another sentence in that paragraph; the key step, as Kutta and Quill noted, is that “not A” is logically equivalent to “if A, then False”, and in particular the statement “

2+2=5is not provable” is logically equivalent to “if2+2=5were provable, then 2+2=5″, and then we can then run Löb’s Theorem.Unless I have completely got something wrong, it should go like this:

I can’t prove a contradiction. I.e. I can’t prove false statements.

For all X: If I find a proof of X, X is true. I.e. For all X: L(X)

or all X: X (Löb’s Theorem).

Perhaps this is the wrong venue, but I’m curious how this work generalizes and either applies or doesn’t apply to other lines of research.

Schmidhuber’s group has several papers on “Goedel machines” and it seems like they involve the use of proofs to find self-rewrites.

Their 2005 paper “Completely Self-Referential Optimal Reinforcement Learners” explained the design and their 2012 paper “Towards an Actual Gödel Machine Implementation” seems to be working towards making something vaguely practical. This is the same group whose PhD students helped founded of DeepMind (like Shane Legg as founder and several others as very early employees). Deepmind was then acquired by Google in 2014.

Since that architecture uses a theorem proving system, and creates new versions of itself, and can even replace its own theorem proving system, it naively seems like the Löbstacle might come up. Are you familiar with Schmidhuber’s group’s work? Does it seem like their work will run into the Löbstacle and they’re just not talking about it? Or does it seem like their architecture makes worries about the Löbstacle irrelevant via some clever architecting?

Basically, my question is “The Löbstacle and Gödel Machines… what’s up with them?” :-)

This is several months too late, but yes! Gödel Machines runs into the Löbstacle, as seen in this MIRI paper. From the paper:

Actually, why is it that when the Lobian obstacle is discussed that it seem to always be in reference to an AI trying to determine if a successor AI is safe, and not an AI trying to determine whether or not it, itself, is safe?

Because we’re talking about criteria for action, not epistemology. The heart of the Lobstacle problem is that straightforward ways of evaluating the consequences of actions start to break down when those consequences involve the outcomes of deductive processes equal to or greater than the one brought to bear.

Question: If we do manage to build a strong AI, why not just let it figure this problem out on its own when trying to construct a successor? Almost definitionally, it will do a better job of it than we will.

The biggest problem with deferring the Lobstacle to the AI is that you could have a roughly human-comparable AI which solves the Lobstacle in a hacky way, which changes the value system for the successor AI, which is then intelligent enough to solve the Lobstacle perfectly and preserve that new value system. So now you’ve got a superintelligent AI locked in on the wrong target.

If you want to take that as a definition, then we can’t build a strong AI without solving the Lobstacle!

Yes, obviously. We solve the Lobstacle by not ourselves running on formal systems and sometimes accepting axioms that we were not born with (things like PA). Allowing the AI to only do things that it can prove will have good consequences using a specific formal system would make it dumber than us.

I think, rather, that humans solve decision problems that involve predicting other human deductive processes by means of some evolved heuristics for social reasoning that we don’t yet fully understand on a formal level. “Not running on formal systems” isn’t a helpful answer for how to make good decisions.

I think that the way that humans predict other humans is the wrong way to look at this, and instead consider how humans would reason about the behavior of an AI that they build. I’m not proposing simply “don’t use formal systems”, or even “don’t limit yourself exclusively to a single formal system”. I am actually alluding to a far more specific procedure:

Come up with a small set of basic assumptions (axioms)

Convince yourself that these assumptions accurately describe the system at hand

Try to prove that the axioms would imply the desired behavior

If you cannot do this return for the first step and see if additional assumptions are necessary

Now it turns out that for almost any mathematical problem that we are actually interested in, ZFC is going to be a sufficient set of assumptions, so the first few steps here are somewhat invisible, but they are still there. Somebody need need to come up with these axioms for the first time, and each individual who wants to use them should convince themselves that they are reasonable before relying on them.

A good AI should already do this to some degree. It needs to come up with models of a system that it is interacting with before determining its course of action. It is obvious that it might need to update what assumptions it’s using the model physical laws, why shouldn’t it just do the same thing for logical ones?

Similar to V_V’s comment. I’m only a layman, but I don’t understand one of the first steps.

The paper appears to jump straight into the logic, justifying itself (pardon the pun) by a statement about the intricacies of formal systems proving things about other formal systems.

Can you explain how and why all AI systems correspond to formal systems—what precisely you mean by ‘formal system’?

(Btw, I have in fact read GEB, but I am still unable to confidently define a formal system.)

Or a better question might be “Why does this need to be possible within the realms of formal logic to be doable at the level of code/AI? There are many general problems intractable as part of pure maths, yet many numerical methods exist to solve particular instantiations. Why must this be doable at the level of pure logic?”

Good question as well! I should add motivation to the notes on this point.

Basically, we study deterministic mathematical objects within formal systems because these cases are easier for us to analyze and prove theorems about; it’s an example of looking first under the streetlight before meandering farther into the darkness. Hopefully this way we’ll notice many phenomena that still hold when we add randomness, bounded computations, and other realistic limitations.

And indeed, plenty of the topics in MIRI research do hold up when you make them messier. Many of the topics in formal systems can handle random variables just as easily as deterministic facts. There’s a version of Löb’s Theorem for bounded-length proof searches, and tracing it through shows that Löbian cooperation and other phenomena behave the same there as in the unbounded case (for sufficiently large choices of the bounds). Using quantum random variables to evaluate logical counterfactuals doesn’t solve implementing UDT, it just collapses it into CDT; so we still need to study logical counterfactuals. And so on.

Also, when I say “formal system” you can usually substitute in “Peano Arithmetic” or “Peano Arithmetic plus some extra axioms”.

Right! So you’re trying to get ahold of the idea of an intelligent computational agent, in clear formalisms, and trying to solve the basic issues that arise there. And often, the issues you discover at the fundamental mathematical level work their way through to the highly applied level.

That makes sense. I feel like this is the most direct answer to the question

Thanks! I’ll try and work that into the introduction.

Why do we think that agents work deductively? Nothing in the AI or cog-sci literature indicates this, and as an “easy case” for simplified treatment, it seems in fact to expand out to the point of seeming impossibility. Are we attacking deductive-only agents

becausethey appear to be sick with a Hard Problem of Vingeian Reasoning?I haven’t read the paper yet (thanks for posting it, anyway), so maybe the answer to my question is there, but there is something about MIRI interest with Löb’s theorem that always bugged me, specifically:

Sure, DT1 can’t prove that DT2 decisions are reliable, and in fact in general it can’t even prove that DT1 itself makes reliable decisions, but DT1 may be able to prove “Assuming that DT1 decisions are reliable, then DT2 decisions are reliable”.

Isn’t that enough for all practical purposes?

Notice that this even makes sense in the limit case where DT2 = DT1, which isn’t necessarily just a theoretical pathological case but can happen in practice when even a non-self-modifying DT1 ponders “Why should I not kill myself?”

Am I missing something?

Isn’t Löb’s theorem just essentially a formal way of showing that you can’t prove that you are not insane?

Good question! Translating your question to the setting of the logical model, you’re suggesting that instead of using provability in Peano Arithmetic as the criterion for justified action, or provability in PA + Con(PA) (which would have the same difficulty), the agent uses provability under the assumption that its current formal system (which includes PA) is consistent.

Unfortunately, this turns out to be an inconsistent formal system!

Thus, you definitely do not want an agent that makes decisions on the criterion “if I assume that my own deductions are reliable, then can I show that this is the best action?”, at least not until you’ve come up with a heuristic version of this that doesn’t lead to awful self-fulfilling prophecies.

I don’t think he was talking about self-PA, but rather an altered decision criteria, such that rather that “if I can prove this is good, do it” it is “if I can prove that if I am consistent then this is good, do it” which I think doesn’t have this particular problem, though it does have others, and it still can’t /increase/ in proof strength.

Yes.

Mmm, I think I can see it.

What about “if I can prove that if a version of me with unbounded computational resources is consistent then this is good, do it”. (*) It seems to me that this allows increase in proof strength up to the proof strength of that particular ideal reference agent.

(* there should be probably additional constraints that specify that the current agent, and the successor if present, must be provably approximations of the unbounded agent in some conservative way)

“if I can prove that if a version of me with unbounded computational resources is consistent then this is good, do it”

In this formalism we generally assume infinite resources anyway. And even if this is not the case, consistent/inconsistent doesn’t depend on resources, only on the axioms and rules for deduction. So this still doesn’t let you increase in proof strength, although again it should help avoid losing it.

If we are already assuming infinite resources, then do we really need anything stronger than PA?

A formal system may be inconsistent, but a resource-bounded theorem prover working on it might never be able to prove any contradiction for a given resource bound. If you increase the resource bound, contradictions may become provable.