Naturalistic trust among AIs: The parable of the thesis advisor’s theorem

Eliezer and Marcello’s article on tiling agents and the Löbian obstacle discusses several things that you intuitively would expect a rational agent to be able to do that, because of Löb’s theorem, are problematic for an agent using logical reasoning. One of these desiderata is naturalistic trust: Imagine that you build an AI that uses PA for its mathematical reasoning, and this AI happens to find in its environment an automated theorem prover which, the AI carefully establishes, also uses PA for its reasoning. Our AI looks at the theorem prover’s display and sees that it flashes a particular lemma that would be very useful for our AI in its own reasoning; the fact that it’s on the prover’s display means that the prover has just completed a formal proof of this lemma. Can our AI now use the lemma? Well, even if it can establish in its own PA-based reasoning module that there exists a proof of the lemma, by Löb’s theorem this doesn’t imply in PA that the lemma is in fact true; as Eliezer would put it, our agent treats proofs checked inside the boundaries of its own head different from proofs checked somewhere in the environment. (The above isn’t fully formal, but the formal details can be filled in.)

At the MIRI’s December workshop (which started today), we’ve been discussing a suggestion by Nik Weaver for how to handle this problem. Nik starts from a simple suggestion (which he doesn’t consider to be entirely sufficient, and his linked paper is mostly about a much more involved proposal that addresses some remaining problems, but the simple idea will suffice for this post): Presumably there’s some instrumental reason that our AI proves things; suppose that in particular, the AI will only take an action after it has proven that it is “safe” to take this action (e.g., the action doesn’t blow up the planet). Nik suggests to relax this a bit: The AI will only take an action after it has (i) proven in PA that taking the action is safe; OR (ii) proven in PA that it’s provable in PA that the action is safe; OR (iii) proven in PA that it’s provable in PA that it’s provable in PA that the action is safe; etc.

Now suppose that our AI sees that lemma, A, flashing on the theorem prover’s display, and suppose that our AI can prove that A implies that action X is safe. Then our AI can also prove that it’s provable that A → safe(X), and it can prove that A is provable because it has established that the theorem prover works correctly; thus, it can prove that it’s provable that safe(X), and therefore take action X.

Even if the theorem prover has only proved that A is provable, so that the AI only knows that it’s provable that A is provable, it can use the same sort of reasoning to prove that it’s provable that it’s provable that safe(X), and again take action X.

But on hearing this, Eliezer and I had the same skeptical reaction: It seems that our AI, in an informal sense, “trusts” that A is true if it finds (i) a proof of A, or (ii) a proof that A is provable, or—etc. Now suppose that the theorem prover our AI is looking at flashes statements on its display after it has established that they are “trustworthy” in this sense—if it has found a proof, or a proof that there is a proof, etc. Then when A flashes on the display, our AI can only prove that there exists some n such that it’s “provable^n” that A, and that’s not enough for it to use the lemma. If the theorem prover flashed n on its screen together with A, everything would be fine and dandy; but if the AI doesn’t know n, it’s not able to use the theorem prover’s work. So it still seems that the AI is unwilling to “trust” another system that reasons just like the AI itself.

I want to try to shed some light on this obstacle by giving an intuition for why the AI’s behavior here could, in some sense, be considered to be the right thing to do. Let me tell you a little story.

One day you talk with a bright young mathematician about a mathematical problem that’s been bothering you, and she suggests that it’s an easy consequence of a theorem in cohistonomical tomolopy. You haven’t heard of this theorem before, and find it rather surprising, so you ask for the proof.

“Well,” she says, “I’ve heard it from my thesis advisor.”

“Oh,” you say, “fair enough. Um—”


“You’re sure that your advisor checked it carefully, right?”

“Ah! Yeah, I made quite sure of that. In fact, I established very carefully that my thesis advisor uses exactly the same system of mathematical reasoning that I use myself, and only states theorems after she has checked the proof beyond any doubt, so as a rational agent I am compelled to accept anything as true that she’s convinced herself of.”

“Oh, I see! Well, fair enough. I’d still like to understand why this theorem is true, though. You wouldn’t happen to know your advisor’s proof, would you?”

“Ah, as a matter of fact, I do! She’s heard it from her thesis advisor.”


“Something the matter?”

“Er, have you considered...”

Oh! I’m glad you asked! In fact, I’ve been curious myself, and yes, it does happen to be the case that there’s an infinitely descending chain of thesis advisors all of which have established the truth of this theorem solely by having heard it from the previous advisor in the chain.” (This parable takes place in a world without a big bang—human history stretches infinitely far into the past.) “But never to worry—they’ve all checked very carefully that the previous person in the chain used the same formal system as themselves. Of course, that was obvious by induction—my advisor wouldn’t have accepted it from her advisor without checking his reasoning first, and he would have accepted it from his advisor without checking, etc.”

“Uh, doesn’t it bother you that nobody has ever, like, actually proven the theorem?”

“Whatever in the world are you talking about? I’ve proven it myself! In fact, I just told you that infinitely many people have each proved it in slightly different ways—for example my own proof made use of the fact that my advisor had proven the theorem, whereas her proof used her advisor instead...”

This can’t literally happen with a sound proof system, but the reason is that that a system like PA can only accept things as true if they have been proven in a system weaker than PA—i.e., because we have Löb’s theorem. Our mathematician’s advisor would have to use a weaker system than the mathematician herself, and the advisor’s advisor a weaker system still; this sequence would have to terminate after a finite time (I don’t have a formal proof of this, but I’m fairly sure you can turn the above story into a formal proof that something like this has to be true of sound proof systems), and so someone will actually have to have proved the actual theorem on the object level.

So here’s my intuition: A satisfactory solution of the problems around the Löbian obstacle will have to make sure that the buck doesn’t get passed on indefinitely—you can accept a theorem because someone reasoning like you has established that someone else reasoning like you has proven the theorem, but there can only be a finite number of links between you and someone who has actually done the object-level proof. We know how to do this by decreasing the mathematical strength of the proof system, and that’s not satisfactory, but my intuition is that a satisfactory solution will still have to make sure that there’s something that decreases when you go up the chain of thesis advisors, and when that thing reaches zero you’ve found the thesis advisor that has actually proven the theorem. (I sense ordinals entering the picture.)

...aaaand in fact, I can now tell you one way to do something like this: Nik’s idea, which I was talking about above. Remember how our AI “trusts” the theorem prover that flashes the number n which says how many times you have to iterate “that it’s provable in PA that”, but doesn’t “trust” the prover that’s exactly the same except it doesn’t tell you this number? That’s the thing that decreases. If the theorem prover actually establishes A by observing a different theorem prover flashing A and the number 1584, then it can flash A, but only with a number at least 1585. And hence, if you go 1585 thesis advisors up the chain, you find the gal who actually proved A.

The cool thing about Nik’s idea is that it doesn’t change mathematical strength while going down the chain. In fact, it’s not hard to show that if PA proves a sentence A, then it also proves that PA proves A; and the other way, we believe that everything that PA proves is actually true, so if PA proves PA proves A, then it follows that PA proves A.

I can guess what Eliezer’s reaction to my argument here might be: The problem I’ve been describing can only occur in infinitely large worlds, which have all sorts of other problems, like utilities not converging and stuff.

We settled for a large finite TV screen, but we could have had an arbitrarily larger finite TV screen. #infiniteworldproblems

We have Porsches for every natural number, but at every time t we have to trade down the Porsche with number t for a BMW. #infiniteworldproblems

We have ever-rising expectations for our standard of living, but the limit of our expectations doesn’t equal our expectation of the limit. #infiniteworldproblems

-- Eliezer, not coincidentally after talking to me

I’m not going to be able to resolve that argument in this post, but briefly: I agree that we probably live in a finite world, and that finite worlds have many properties that make them nice to handle mathematically, but we can formally reason about infinite worlds of the kind I’m talking about here using standard, extremely well-understood mathematics.

Because proof systems like PA (or more conveniently ZFC) allow us to formalize this standard mathematical reasoning, a solution to the Löbian obstacle has to “work” properly in these infinite worlds, or we would be able to turn our story of the thesis advisors’ proof that 0=1 into a formal proof of an inconsistency in PA, say. To be concrete, consider the system PA*, which consists of PA + the axiom schema “if PA* proves phi, then phi” for every formula phi; this is easily seen to be inconsistent by Löb’s theorem, but if we didn’t know that yet, we could translate the story of the thesis advisors (which are using PA* as their proof system this time) into a formal proof of the inconsistency of PA*.

Therefore, thinking intuitively in terms of infinite worlds can give us insight into why many approaches to the Löbian family of problems fail—as long as we make sure that these infinite worlds, and their properties that we’re using in our arguments, really can be formalized in standard mathematics, of course.