Walkthrough of the Tiling Agents for Self-Modifying AI paper

This is my walk­through of Tiling Agents for Self-Mod­ify­ing AI, and the Löbian Ob­sta­cle. It’s meant to sum­ma­rize the pa­per and provide a slightly-less-tech­ni­cal in­tro­duc­tion to the con­tent. I’ve also col­lected a num­ber of ty­pos and sug­ges­tions, which can be found at the end of the post.


We want to be able to con­sider agents which build slightly bet­ter ver­sions of them­selves, which build slightly bet­ter ver­sions of them­selves, and so on. This is referred to as an agent “tiling” it­self. This in­tro­duces a ques­tion: how can the par­ent agent trust its de­scen­dants?

We’ll be deal­ing with log­i­cal sys­tems, which are crisply de­ter­minis­tic. In such a set­ting, child agents can be “ab­solutely trust­wor­thy”: for ex­am­ple, if the child uses the same meth­ods of proof and the same ax­ioms as the par­ent, then the par­ent should be able to trust the child un­con­di­tion­ally: or, at least, the par­ent should trust the child at least as much as it trusts it­self.

This turns out to be a huge prob­lem, be­cause mod­ern log­i­cal sys­tems don’t trust them­selves (or any log­i­cal sys­tems of equal strength), to say noth­ing of (sup­pos­edly more pow­er­ful) suc­ces­sor sys­tems.

This prob­lem of trust is rooted in a the­o­rem stated by the math­e­mat­i­cian Löb.

As­sum­ing that we can sur­mount such difficul­ties, there are a num­ber of other prin­ci­ples that we want in our tiling agents, such as the Vingean prin­ci­ple (the par­ent shouldn’t have to know ex­actly what the chil­dren will do) and the Nat­u­ral­is­tic prin­ci­ple (agents should think of them­selves as part of the world). Th­ese will be dis­cussed in de­tail later in the pa­per.


  • I’ll use @x to mean “the ex­e­cu­tion of x”. (The pa­per uses x̄ in­stead.)

  • I’ll use [x] de­notes the Gödel num­ber of x in some suit­able en­cod­ing. (The pa­per uses ⌈x⌉ in­stead).

  • □T[x] is read “There is a proof of x in T”. It’s the same as “∃y. Prooves_T(y, [x])”. The T is omit­ted when it’s clear from the con­text.

  • A0, A1, … will de­note agents. Note that A0 is the child and A1 is the par­ent. I per­son­ally would have done things the other way around, but I’ll stick to the pa­per’s de­scend­ing style.

  • G will be used to de­note a world-state in which the agent’s goals have been achieved.

  • ∅ will be used to de­note a world-state wherein noth­ing of im­port has been changed by the agent. For more speci­fic­ity, see the pa­per.

  • A ⊩ x is read “A has cog­ni­tively con­cluded x” or “A be­lieves x”.

  • PA is Peano Arith­metic, ZF is Zer­melo-Fraenkel set the­ory.

  • T will de­note a the­ory. In gen­eral, as­sume that T is at least as pow­er­ful as PA.

  • T+1 is T ∪ {¬□T[⊥]} (T plus the fact that T is con­sis­tent). T+2 is (T+1)+1, etc.

Here’s a few re­minders, just in case you’re rusty:

  • φ[x\y] de­notes “φ, with all in­stances of x re­placed by y”. The slash dis­am­biguates this from quot­ing.

  • means “is defined to be”.

  • T ⊢ φ is pro­nounced “T de­rives φ”.

So, for ex­am­ple, T ⊢ □T[φ] is read “T knows that there is a T-proof of φ.”

In gen­eral, I’ll be less for­mal than the pa­per. If I’ve left any am­bi­gui­ties, see the pa­per for clar­ifi­ca­tion.

Trust­ing Yourself

Imag­ine that we’ve got a par­ent agent A1. It wants to achieve the goal G. Among the things it can do is cre­ate more agents. What we want to do is find an agent A0 within the ac­tions of A1 such that:

A1 ⊩ @A0 → (∀ac­tion ∈ Acts0. @ac­tion → □0[@ac­tion → (∅ ∨ G)])

Trans­lated, this reads “A1 be­lieves that, if it con­structs A0, then for all ac­tions available to A0, that ac­tion will only be taken if A0 has proven that tak­ing the ac­tion leads to ei­ther noth­ing of im­port hap­pen­ing, or the goal be­ing achieved.”

(We also re­quire A1 ⊩ @A0 → ((∀ac­tion ∈ Acts0. ¬@ac­tion) → ∅), which says “if A0 takes no ac­tions then noth­ing will hap­pen”.)

We would fur­ther like A1 to be able to trust A0. Ob­vi­ously, we don’t want A1 to trust its chil­dren un­con­di­tion­ally. But as­sum­ing that A0′s proofs use the same mechanism as A1′s proofs, and as­sum­ing that A1 ver­ifies that A0′s proof sys­tem works ex­actly like A1′s own proof sys­tem, we would like to have the fol­low­ing:

A1 ⊩ @A0 → (∀ac­tion ∈ Acts0. @ac­tion → (∅ ∨ G))

ie, A1 be­lieves that, if it con­structs A0, then any ac­tion taken by A0 ac­tu­ally will ei­ther cause noth­ing to hap­pen or achieve the goal.

This is an im­por­tant dis­tinc­tion, make sure you un­der­stand it be­fore pro­ceed­ing. We want A1 to un­wrap A0′s proofs, to look at things in­side the □0 and con­clude that they ac­tu­ally are true. We want A1 to look at □0[φ] and ac­tu­ally con­clude φ.

Un­for­tu­nately, this is im­pos­si­ble — at least, it’s im­pos­si­ble when A1 has con­cluded that A0′s proof sys­tem is iden­ti­cal to A1s proof sys­tem. Why? Be­cause A1 can’t even trust its own for­mal­ism. This is the famed Löb’s the­o­rem.

The Löbstacle

In­tu­itively, Löb’s the­o­rem claims “No sys­tem T can trust its own proofs un­til it wit­nesses them.” More for­mally, it states that in a the­ory T at least as pow­er­ful as PA, □T[φ] → φ ⇔ φ: T can only de­duce φ from a T-proof of φ if φ is ac­tu­ally true.

You might ex­pect that there are some sen­tences ψ which T can­not prove, but for which T knows “if I could prove ψ, then ψ would be true”. This is not the case! There are no sen­tences that T can­not prove, for which T will trust “but if I could prove it, then it would be true”. T does not trust its own proofs un­til it wit­nesses them. Here’s why:

Any T of suffi­cient power can define a sen­tence L ↔ □[L]→φ (L is true iff a proof of L im­plies φ.) T can de­duce the fol­low­ing about such an L for any φ.

  1. L → (□[L]→φ) (from the defi­ni­tion of L)

  2. □[L→(□[L]→φ)] (from 1 be­cause T can prove the things it knows: T ⊢ x ⇒ T ⊢ □[x])

  3. □[L]→□[□[L]→φ] (from 2 be­cause prov­abil­ity dis­tributes over im­pli­ca­tion: T ⊢ □[x→y] ⇒ T ⊢ □[x]→□[y])

  4. □[L]→(□[□[L]]→□[φ]) (from 3 be­cause prov­abil­ity dis­tributes over im­pli­ca­tion)

  5. □[L]→□[□[L]] (be­cause T can prove things it knows)

  6. □[L]→□[φ] (from 4 and 5)

Note that φ here is gen­eral. This means that for any φ, there is a sen­tence L such that φ is prov­able when L is prov­able, and also φ is true when L is prov­able. Now, if T had □[φ]→φ then:

  • □[φ]→φ (as­sump­tion)

  • □[L]→φ (from 6 and as­sump­tion)

  • L (L ↔ □[L]→φ by defi­ni­tion, and we just de­duced □[L]→φ)

  • □[L] (T can prove things it knows)

  • φ

We have now proved φ in gen­eral: such a T can prove any­thing. No­tice what this means: any sys­tem of suffi­cient strength can de­rive a proof of any φ from the mere fact that if prov­able then it is true.

(Fun fact: pick φ=⊥ to get a proof of the sec­ond in­com­plete­ness the­o­rem. ¬□[⊥] means □[⊥]→⊥, so if you can prove ¬□[⊥] (that you don’t prove a con­tra­dic­tion) then you prove ⊥ (a con­tra­dic­tion).)

Con­clu­sion: A sys­tem can­not trust, in gen­eral, that □[φ] im­plies φ. It can’t know that its proofs are valid.

This is go­ing to make our lives difficult: our tiling agents will be gen­er­at­ing situ­a­tions where the par­ent agent ver­ifies “my child will only ex­e­cute ac­tions that they’ve proven (in T) lead to good out­comes”. By Löb’s the­o­rem, such agents, run­ning on T, can­not con­clude “and so my child, also run­ning on T, will only ex­e­cute ac­tions that lead to good out­comes” — be­cause it can­not trust its own proof sys­tem.

Finite Des­cent Problems

A sys­tem can prove weaker sys­tems con­sis­tent. For ex­am­ple, T+1 can prove T con­sis­tent, and T+2 can prove T+1 con­sis­tent, etc. You might think that we can work around the prob­lem by hav­ing our ini­tial sys­tem run on T+n for large n. But this fails.

First of all, if you pick a large finite n then your sys­tem has a finite “telomere”, a trust-ca­pac­ity that is run­ning out. And even if you try to start with T+ω, it has to be­queath unto its child sys­tem a T+n where n is an or­di­nal less than ω. Though such an n may be ar­bi­trar­ily large, it must be finite, and then we have the finite fuse again. In gen­eral, no mat­ter how in­finitely large an or­di­nal you start with, it will still suffer from a finite de­scent prob­lem (be­cause the or­di­nals are well-or­dered).

And sec­ondly, T+1 is far more pow­er­ful than T. Take, for ex­am­ple, the hy­dra game. It turns out (spoiler alert) you can’t lose: any strat­egy beats ev­ery hy­dra.

Now con­sider PA. Given an h de­not­ing the height of the hy­dra, PA can prove that any strat­egy beats hy­dras of height h. But PA can­not con­clude that any strat­egy beats ev­ery hy­dra. Let­ting ψ(h) de­note “hy­dras of height h can be kil­led by any strat­egy”, it is the case that for all h, PA ⊢ ψ(h). But PA does NOT de­rive ∀h. ψ(h)! PA can­not rule out the pos­si­bil­ity that there is some height at which its proofs will fail: it only trusts proofs that it wit­nesses.

PA+1, on the other hand, has PA+1 ⊢ ∀h. ψ(h) be­cause it knows that PA is con­sis­tent, and thus that there is no height at which PA’s proofs start failing. Thus, in a very real sense, there are prob­lems (like the hy­dra prob­lem, or like check­ing whether a Gold­stein se­quence ter­mi­nates) that can be solved in PA+1 but not PA.

In fact, an agent’s strength is di­rectly re­lated to which or­di­nals it can well-or­der: the abil­ity to well-or­der high or­di­nals di­rectly trans­lates into math­e­mat­i­cal power. Child agents run­ning on weaker math­e­mat­i­cal sys­tems cor­re­spond to a very real loss in ca­pa­bil­ities.

Hav­ing each agent build a weaker child is un­ten­able, and by Löb’s the­o­rem we can’t have the child agent run on the same sys­tem as the par­ent (be­cause the par­ent can’t trust its own sys­tem). Yet we de­mand away to tile agents with­out hav­ing their trust drop be­low a cer­tain baseline. Is this even pos­si­ble? Ex­pe­rience with Gödelian prob­lems may lead you to think that the an­swer is “No”, but as we will see, we can gain a few tech­ni­cal vic­to­ries.

Failed Solutions

Here are a few ideas to get things rol­ling:

Re­cur­sive Diving

Define Safe(x) to be (@x → ∅ ∨ G): in other words, Safe(x) means “ex­e­cut­ing x leads to ei­ther no change, or the goal be­ing achieved.”

In­stead of re­quiring the agent to know □[Safe(x)] → Safe(x), we in­stead define a proxy for safety. This proxy is ψ(x) ↔ Safe(x) ∨ □[ψ(x)]. Then the con­di­tion of ac­tion is ψ(x).

The idea is that in­stead of re­quiring the sys­tem to ac­cept that a proof of safety makes things safe, we in­stead ask it to ei­ther know that an ac­tion is safe, or be able to prove ψ about the ac­tion (which means ei­ther we can prove the ac­tion safe or we can prove we can prove ψ about the ac­tion (which means ei­ther we can prove we can prove the ac­tion safe, or …)). Naïvely, you might ex­pect that the only way out of this re­cur­sive dive is for the sys­tem to ac­cept Safe(x) at some depth of proof. But that’s not what hap­pens.

In fact, ψ(x) is ac­cepted im­me­di­ately, re­gard­less of the safety of the ac­tion. ψ is in fact a Henkin sen­tence: It is of the form ψ(x) ↔ □[ψ(x)] (with a dan­gling ∨ Safe(x) that we can ig­nore). Henkin sen­tences are uni­ver­sally true. Our sys­tem can ig­nore the safety con­straint, skip the re­cur­sive de­scent, and con­clude any ac­tion is OK. This makes it… rather un­suited to the real world.

Buck Tossing

In­stead of re­quiring □T[φ] → φ, we con­struct a tower of the­o­ries. Each the­ory trusts the next the­ory, but not it­self. So T0 has □T1[φ] → φ, and T1 has □T2[φ] → φ. This al­lows par­ents to trust their chil­dren with­out ever trust­ing them­selves (and thus be­com­ing in­con­sis­tent.)

Prob­lem is, if T0 is as strong as PA then it can de­duce that proofs in T0 are valid in T1: it just bumps all the ap­pro­pri­ate ‘n’ up by 1. Then T0 de­duces that □T0[φ] → □T1[φ], from which it de­duces φ. Triumphant, T0 proudly con­cludes □T0[φ]→φ, and im­me­di­ately bursts into in­con­sis­tent flames.

Proof Stacking

It doesn’t mat­ter whether T trusts it­self. WE trust T. If Safe(x), or T proves Safe(x), or T proves that T proves Safe(x), or if there is any depth of proofi­ness such that Safe(x), then the agent may do x.

Let □^k[x] stand for “T proves that T proves that … (k times) … that Safe(x)”. (As­sume k nonzero through­out.) Be­cause this stack­ing has a reg­u­lar struc­ture, there should be a com­pact for­mula ex­press­ing this. So what if we make T’s crite­rion of ac­tion be (Safe(x) ∨ □^k[x])?

Then, when a child con­cludes (Safe(x) ∨ □^k[x]), the par­ent con­cludes that □[Safe(x) ∨ □^k[Safe(x)]].

Prob­lem is, □[Safe(x) ∨ □^k[Safe(x)]] doesn’t re­duce to (Safe(x) ∨ □^j[Safe(x)]), no mat­ter how much we want it to. This is be­cause □^k[φ] is ac­tu­ally a for­mula like (∃ k. Stack­OProofs(k, φ)), and □[∃ k. Stack­OProofs(k, x)] does not re­duce to (∃ j. Stack­OProofs(j, x)), pre­cisely be­cause T does not trust that □[φ]→φ. So this tech­nique doesn’t re­ally dodge the Löb­sta­cle at all.

Is it com­pletely hope­less? No! Just mostly hope­less. We can, it turns out, achieve a few tech­ni­cal vic­to­ries.

Tech­ni­cally Solutions

In­finite Descent

Idea 2 above (buck-toss­ing) is reparable. The idea, you re­mem­ber, is that T0 trusts T1, which trusts T2, etc. The prob­lem was that T0 could de­duce that, by bump­ing all the n in a proof, things true in T0 were true in T1. It could then “lift” its proofs into T1, prove it­self con­sis­tent, and be­come in­con­sis­tent.

This can be fixed, if we some­how find a way to al­low Tn to trust Tn+1 with­out be­ing able to trust that it can lift its own proofs up­wards. For ex­am­ple, in­stead of hav­ing as an ax­iom of Tn that Tn+1[φ]→φ, we could guard the ax­iom by a for­mula ψ. Thus, the new ax­iom be­comes ψ(n) → (Tn+1[φ] → φ).

Now, if only we could find a ψ such that ψ is true for all n, and T can prove ψ(n) for each in­di­vi­d­ual n, but such that T can­not prove ∀n. ψ(n)…

And we can! We were just talk­ing about such for­mu­las ear­lier. We need a for­mula that T-0 can prove for ev­ery n, but that it can’t prove it can prove for ev­ery in. (In other words, for all n we need T ⊢ ψ(n), but we also need that T can NOT de­rive ∀n. ψ(n).)

In or­der to do this, we’re go­ing to need a the­ory more pow­er­ful than T: a the­ory Z such that T can­not prove Z con­sis­tent. (Why? We’ll get to that later.) For con­crete­ness, if T is PA then ZF is more pow­er­ful in the ap­pro­pri­ate man­ner.

Now, here’s an ex­am­ple guardian for­mula: ψ(n)≡¬ProofZ(n, [⊥]) or, in en­glish, ψ(n)=”n does not en­code a proof of con­tra­dic­tion in Z”.

For ev­ery n, T0 can prove ψ(n). But T0 can­not prove that ∀n. ψ(n). Thus, it can no longer “lift” its proofs into the higher sys­tem: “in­creas­ing all the n’s” is no longer a safe op­er­a­tion. T0 does not ex­pect ψ to hold for­ever.

We can prove this sys­tem con­sis­tent. I’ll sketch the proof; see the pa­per for de­tails. We’ll con­struct a sound sys­tem T+ that trusts T and rea­sons about T0. We’ll show that T+ does not prove T0 in­con­sis­tent, and that if T0 were con­sis­tent then T+ would know.

T+ is built from T, a new sym­bol “TRUE”, and ax­ioms stat­ing both that T is con­sis­tent and TRUE is a truth pred­i­cate for T. We note that T+ is strong enough to rea­son about T0 and proofs in T0. T+ rea­sons as fol­lows:

Sup­pose T0 is in­con­sis­tent. Then □T0[⊥]. Con­sider the proof. It has a bad line. The bad line can’t come from the ax­ioms of TRUE, nor from the use of T, as those are trusted. There­fore, it must come from the spe­cial ax­iom, and be of the form ψ(n) → (□Tn[φ(x)] → φ(x)). In or­der for this to be the point of failure, ψ(n) must be true, □Tn[φ] must be true, and φ must be false. Con­sider the Tn proof of φ. Re­curse: its first bad line must also come from a ψ(n) ax­iom, and its first bad line must come from a ψ(n) ax­iom, and so on and so forth. Thus, ψ(n) must be true for all n. So, □T0[⊥]→(∀n. ψ(n)).

This rea­son­ing shows that T+ knows that if T0 is in­con­sis­tent then ψ is true for all n. Fur­ther­more, no­tice that if ψ is true for all n then Z is con­sis­tent.

If there were a proof of con­tra­dic­tion in T0, the proof could be fed to T+, and then T+ would be able to prove the con­sis­tency of Z.

We’ve es­tab­lished that T (and thus T+) can­not prove the con­sis­tency of Z. There­fore, T0 is con­sis­tent.

This is nice, but it’s not with­out prob­lems. For ex­am­ple, T0 ⊢ (∀n. ψ(n)) → ⊥: T0 knows that if ψ is true for all n, then it can bump its proofs up­wards. It fol­lows that T0 ⊢ ∃n. ¬ψ(n): T0 be­lieves that some n en­codes a proof of con­tra­dic­tion in Z. Con­cretely, when T=PA and Z=ZF, T thinks that ZF is in­con­sis­tent. This is… not the type of be­hav­ior that we strive for in real world agents, to say the least.

Also, this means that T0 has only non­stan­dard mod­els: T0 is con­sis­tent, and thus has a model, but as we’ve seen, T0 only has mod­els where T is con­sis­tent but ZF is in­con­sis­tent. This, too, is a hin­der­ance.

Re­gard­less, the pa­per has shown that you can tile agents with­out let­ting their trust fall be­low a cer­tain min­i­mum. (All of T0, T1, T2, … trust T.) This sys­tem doesn’t trust it­self, and thinks ZF is in­con­sis­tent, but we’ve still shat­tered the Löb­sta­cle. Via tech­ni­cal tom­fool­ery, we have a math­e­mat­i­cal sys­tem that can trust a an­other sys­tem of the same power.

This is a cause for hope.

Ques­tion: What if we use ψ(n)≡¬ProofT(n,¬□T[⊥]) (ie ψ(n) means “n does not en­code a proof of T’s con­sis­tency”). The ex­ist­ing proof of con­sis­tency fails be­cause T+ proves T con­sis­tent, but can we prove T0′s con­sis­tency by other means? The fact that T0 ⊢ ∃n. ¬ψ(n) in this case means that T0 be­lieves T is con­sis­tent. This is far nicer than T be­liev­ing Z to be in­con­sis­tent. I sus­pect that such a T0 can rea­son some­thing like “If T is con­sis­tent then so am I”, thus doom­ing it­self, but I haven’t had time to play with this yet.

Up­date: This does fail. The ac­tual proof path is T □(∀n. Tn⊥)→(∀n. Tn⊥), so by Löb’s the­o­rem T proves all such Tn in­con­sis­tent. In other words, if T could prove all Tn con­sis­tent then they would all know it, and would be in­con­sis­tent: and so, by Löb’s the­o­rem, they are.

Para­met­ric Polymorphism

The trick that made the above ap­proach work was that we gave the the­ory a trust con­di­tion that (from the per­spec­tive of the the­ory) “might run out”. Only from out­side the the­ory do we know that the trust will not run out.

The sec­ond ap­proach — para­met­ric poly­mor­phism — makes that tech­nique some­what more ex­plicit.

In para­met­ric poly­mor­phism, we re­quire that the the­ory T un­der con­sid­er­a­tion can be mod­eled by the nat­u­ral num­bers. (This gives us some nice prop­er­ties that we’ll need to use for the con­sis­tency proof.)

We start by adding a new term ‘k’ to our log­i­cal lan­guage. This k, the “num­ber of in­ef­fable mys­tery”, will be rep­re­sented by a num­ber (when N is the model), but T has no knowl­edge about k.

Re­mem­ber that the goal is to have some­thing like □[φ] → φ (though we can never achieve this). In­stead, we add an ax­iom that (k > 0) → □Tk[φ] → φ[k\k-1]. In other words, so long as k is pos­i­tive a proof in Tk of φ im­plies that φ is true af­ter re­plac­ing all k by k-1.

T, in gen­eral, can­not prove that Tk[φ]→φ even for φ that do not men­tion k, be­cuase Tk doesn’t know whether k is pos­i­tive. How­ever, it can prove that when Tk is nonzero it can be­lieve φ “at the next k-step”. Proofs in Tk will need nu­mer­ous guards con­cern­ing the pos­i­tivity of Tk, but in the pos­i­tive branches it will be able to trust Tk proofs: and out­side the sys­tem, we know that the pos­i­tive branches will always be the ones taken.

Be­fore we get ex­cited, we have to check whether Tk is con­sis­tent. As a mat­ter of fact, it is. We can show this by pre­sent­ing a model of Tk: this we can do triv­ially. {N, k=0} (the nat­u­rals, with k mod­eled as 0) is triv­ially a model of Tk: N is a model of T, and when k=0 the new ax­iom schema is triv­ially true.

Models where k=0 are not ex­actly use­ful to us, though: in these mod­els, the k=0 branches of the proofs are taken, and Tk never achieves self-trust. What we want is to show that there are mod­els of Tk where k is ar­bi­trar­ily large. We can prove this via in­duc­tion.

We’ve already got a base case: {N, k=0} is a model of Tk. For the in­duc­tion step, we need to show that if {N, k=n} is a model of Tk then {N, k=n+1} is a model of Tk. To show this, note:

  1. If {N, k=n+1} ⊧ □Tk[φ] then {N, k=n} ⊧ □Tk[φ]

  2. If {N, k=n} ⊧ □Tk[φ] then {N, k=n} ⊧ φ

  3. If {N, k=n} ⊧ φ then {N, k=n+1} ⊧ φ[k\k-1]

(1) is true be­cause □Tk has no ac­cess to k.

(2) is true be­cause if □Tk[φ] is a proof mod­eled a stan­dard model then φ is se­man­ti­cally deriv­able within Tk by a stan­dard (ie clas­si­cal) proof. When­ever Γ → Δ via a clas­si­cal proof, all mod­els of Γ are also mod­els of Δ. Since {N, k=n} is a model of Tk, it is also a model of all things prov­able in Tk via stan­dard proof.

(3) is triv­ial by sub­sti­tu­tion.

We can ex­tend these points to cover for­mu­las free in one vari­able by not­ing that, be­cause we are deal­ing with N as our model, ev­ery term x cor­re­sponds to a finite con­stant c.

Com­bin­ing these three points we get that if {N, k=n+1} ⊧ □Tk[φ] then {N, k=n+1} ⊧ φ[k\k-1], which com­pletes the in­duc­tion step.

We con­clude that {N, k=n} is a model of Tk for any n.

We can use this to con­struct tiling agents. The proof is tech­ni­cal (there’s a bit of fi­nagling to do be­cause when the child has proven an ac­tion safe for k steps, the par­ent con­cludes that it is safe for k-1 steps). The proof in the pa­per still has a num­ber of ty­pos. (See the end of the post.) I won’t go over the proof in de­tail here. It suffices to say that when the child proves an ac­tion safe for k steps, the par­ent con­cludes that it’s safe for at least k-1 steps and thus (con­di­tional upon k be­ing pos­i­tive) A1 can con­struct A0. Out­side the sys­tem, we know that k can be ar­bi­trar­ily large, and so con­clude that the sys­tem is safe for an ar­bi­trary num­ber of steps.

This sys­tem, like the last, would not be the best choice as a be­lief-sys­tem for a ra­tio­nal agent. It’s not clear what k rep­re­sents to the agent, nor what it would mean for a ra­tio­nal agent to “ac­tu­ally be­lieve” Tk. As be­fore, Tk is not pre­sented as a re­al­is­tic can­di­date for the men­tal sys­tem of a real agent. Rather, it is pre­sented as yet an­other tech­ni­cal vic­tory over the Löb­sta­cle.

With that done, we can talk about self-mod­ify­ing agents on a more gen­eral level, and will do so in the next few sec­tions.

Prob­a­bil­is­tic Reflection

The pa­per men­tions the Prob­a­bil­is­tic Reflec­tion re­sults, sum­ma­rizes them, and notes that we don’t yet have a for­mal­ism for agents us­ing a prob­a­bil­is­tic re­flec­tion prin­ci­ple. (We don’t even yet know what knowl­edge of prob­a­bil­is­tic re­flec­tion can ap­pear in­side a prob­a­bil­is­ti­cally re­flec­tive sys­tem).

A few days ago I did a that write-up of that pa­per, which may be helpful if you’re look­ing for more walk­throughs such as this one.

What Self-Mod­ify­ing Agents Need

There are three prin­ci­ples that we re­ally want our self-mod­ify­ing agents to ad­here to. I’ll sum­ma­rize these sec­tions briefly, as they are not difficult to grasp. See the pa­per for more.

The Vingean principle

Agents shouldn’t have to pre­dict pre­cisely what their chil­dren do. If the par­ent knows pre­cisely what the child will do — for ex­am­ple, by prov­ing which ac­tion the child will take — then the child is hardly ‘smarter’ than the par­ent. In our log­i­cal sys­tems, we fol­low the Vingean prin­ci­ple by re­quiring that the par­ent only dis­cuss the child’s ac­tions within quan­tifiers. We can have a set rep­re­sent­ing the child-ac­tions, and we can quan­tify over that set, but we can’t have par­tic­u­lar child-ac­tions known ahead of time.

The par­ent may ver­ify prop­er­ties about all ac­tions, e.g. by look­ing at the ac­tion sys­tem and ver­ify­ing that the mo­tors only ac­ti­vate when pre­sented with a valid proof. The par­ent may not e.g. simu­late the child’s ex­act ac­tion.

The tiling principle

Agents should be able to ap­prove the con­struc­tion of agents similar to them­selves. This is what we’ve been search­ing for for­mal­isms to al­low in the above sec­tions. This por­tion of the pa­per also dis­cusses a for­mal­ism where the agent is di­vided into an un­trusted proof sug­ges­tion com­po­nent Σ and a trusted proof ver­ifi­ca­tion com­po­nent Δ. This has some nice prop­er­ties (es­pe­cially in mono­tonic logic, where things that Σ knows but isn’t tel­ling Δ can’t cause Δ to al­low some­thing it shouldn’t), but would quickly break down out­side the realm of logic.

The nat­u­ral­is­tic principle

Agents should think of them­selves as part of the world. There shouldn’t be a Carte­sian bound­ary be­tween “world-stuff” and “mind-stuff”. Other­wise, you have to deal with weird bound­ary prob­lems (at­tach­ing a new CPU re­quires things to cross the world/​mind bound­ary).

Fur­ther­more, hav­ing differ­ent rules for in­ter­nal ac­tions and ex­ter­nal ac­tions is not par­tic­u­larly safe. For ex­am­ple, if cer­tain in­ter­nal mod­ifi­ca­tions are pro­hibited an agent could build an ex­ter­nal thing that hap­pens to ap­ply the de­sired in­ter­nal changes.

But if you do al­low a carte­sian bound­ary then the Löb­sta­cle is easy to avoid. You sim­ply set as your crite­rion of ac­tion to be

ψ(ac­tion) =
(ex­ter­nalOK(ac­tion) → (@ac­tion → G)) ∧
(in­ter­nalOk(ac­tion) → (@ac­tion → (∀ ac­tion’ in Acts. @ac­tion’ → □[ψ(ac­tion’)])))

Trans­la­tion: “The con­di­tion for ac­tion is this: [when ac­tion is ex­ter­nalOK, if we ex­e­cute it then it achieves the goal] and [when ac­tion is in­ter­nalOk, if we ex­e­cute it then it leaves us in a state where we only take ac­tions that meet this con­di­tion].”

Note: There is a bit of am­bi­guity here. I be­lieve that ‘Acts’ is meant to re­fer to the new ac­tion-set available af­ter tak­ing the in­ter­nal ac­tion.

This is similar to our first failed solu­tion, ex­cept that the Carte­sian bound­ary al­lows ψ to avoid be­com­ing a Henkin sen­tence.

Note: On gut in­stinct, I’d con­sider ex­plor­ing ψ(x) = @x → (G ∨ (∀y ∈ ActsX. @y → □[ψ(y)])), which is a ver­sion of the above that does not have a Carte­sian bound­ary and does not im­me­di­ately suc­cumb to the Henkin-sen­tence prob­lem.


Then we try to add sen­sors, and ev­ery­thing blows up.

That’s not ex­actly the case, but adding sen­sors and un­known vari­ables cer­tainly doesn’t make the for­mal­ism any pret­tier.

Sec­tion six is long and some­what in­volved, though it’s say­ing some­thing es­sen­tially sim­ple. I’ll sum­ma­rize:

We want to al­low our agents to have sen­sors, and in­deed con­struct new sen­sors. We’ll say things like

A0 ⊩ @sl ⇒ @Sl

to mean “A0 be­lieves @sl only when sen­sor S takes on the state @Sl”.

This causes a trust prob­lem. We still want A1 to trust A0, de­spite the fact that A0 will have ac­cess to sen­sor data that A1 knows not. We can patch our trust frame­work by chang­ing A0′s crite­rion of ac­tion to be

@bi ⇒ A0 ⊩ (@sl ∧ @bi) → (∅ ∨ G)

(If bi is taken then A0 be­lieves that, with the sen­sors in state sl, tak­ing the ac­tion bi leads to ei­ther ∅ or G). Then, if this is known for ev­ery pos­si­ble com­bi­na­tion of sen­sor states, A1 can re­tain trust in A0.

This is messy, though. Ad­di­tional sen­sors lead to ex­po­nen­tially more com­put­ing power re­quired (on A1′s part) to trust A0. This seems patholog­i­cal. What we want is for A1 to rea­son that, since it can ver­ify that A0 ⊩ @sl only when @Sl (i.e. A1 trusts that the sen­sors take on state sl only when the world is ac­tu­ally in state Sl), any­thing that A0 does based on the sen­sor data is trusted.

And we can get this prop­erty — or some­thing like it. How­ever, this means that A1 must be able to ex­press things like “A0′s rea­son­ing is such that, if we add only “true” sen­sor-data, it does not dis­turb our trust in A0.” So A1 needs to not only trust A0, but must also trust A0 af­ter A0 has had new “true ax­ioms” added in the form of (trusted) sen­sor data.

For­mal­iz­ing this re­quires full-on set the­ory, which can for­mal­ize things like “In all mod­els X, all log­i­cal tau­tolo­gies hold”.

We then cre­ate a lan­guage in which there is a spe­cial set τ (the “ter­ri­tory”) en­cod­ing a model of the en­vi­ron­ment. We then ex­press A1′s trust as fol­lows:

A1 ⊩ @A0 → ∀[φ]. (τ ⊧τ [A0 ⊩ “φ”]) → (τ ⊧τ [φ])

Trans­la­tion: A1 be­lieves that, if it con­structs A0, then for all quoted sen­tences about the ter­ri­tory, if it is true in the ter­ri­tory that A0 be­lieves some A0-en­cod­ing of φ, then it is ac­tu­ally true in the ter­ri­tory that φ.

No­tice the two lev­els of quot­ing. A1 be­lieves that A0′s be­liefs (“φ”) cor­re­spond to state­ments true in the ter­ri­tory ([φ]).

This loosely rep­re­sents how a pro­gram­mer might rea­son about a pro­gram: A pro­gram­mer (A1), who thinks that there is an ob­ject (φ) in the en­vi­ron­ment, will have an in­ter­nal rep­re­sen­ta­tion ([φ]) of the ob­ject. They will also know that the pro­gram (A2) has an ob­ject cor­re­spond­ing to Fred (φ), and they will re­tain a men­tal map be­tween their rep­re­sen­ta­tion of the ob­ject ([φ]) and their rep­re­sen­ta­tion of the pro­gram’s rep­re­sen­ta­tion of the ob­ject ([“φ”]).

The point is that we de­ploy set the­ory to rep­re­sent this sort of logic. This logic, how­ever, al­lows us to rea­son about sen­sors thusly (re­plac­ing ⊧τ with ⊧):

A1 ⊩ @A0 → ∀[sl]. (τ ⊧ [A0 ⊩ “@sl”]) → (τ ⊧ [@Sl])

Which ba­si­cally says “A0 only be­lieves @sl when, in re­al­ity, the sen­sor ac­tu­ally is in state l.”

This is what we wanted, but it does not come cheap: We had to de­ploy full ZF in or­der to al­low A1 to rea­son as we wanted. Worse, A2 (the par­ent of A1) must rea­son in ZF+1 in or­der to trust A1 (be­cause it must be­lieve that there is some model of ZF: oth­er­wise A1 is mean­ingless). A3 must rea­son is ZF+2, and so on.

It is noted that this can prob­a­bly be ad­dressed via para­met­ric poly­mor­phism (ie A2 uses ZFk, and also trusts A1 to use ZFk), though no proof has been writ­ten out yet. A sim­plifi­ca­tion of the above mechanism (with­out re­sort­ing to set the­ory, or at the least us­ing a weaker set the­ory) is de­sired.

Prob­a­bil­ity and Ex­pected Utility

Ad­ding ex­pected util­ity satis­fac­tion is some­what difficult.

Prob­a­bil­is­tic rea­son­ing is non-mono­tonic. This leads to difficul­ties. For ex­am­ple, if you have an un­trusted proof-sug­gester Σ and a proof-ver­ifier Δ, then so long as Δ works you can’t go wrong — in logic. At worst, Σ with­holds a good proof and the sys­tem does noth­ing. But with prob­a­bil­is­tic rea­son­ing, Σ can with­hold in­for­ma­tion from Δ with catas­trophic re­sults.

For ex­am­ple, imag­ine that the con­di­tion of ac­tion is E[U] ≥ $20, where E[U] is the ex­pected util­ity. If we have a fair coin and a bet with pay­out of $50 on H and $10 on T, then the agent will take the bet. But if we later learn that the coin landed T, we should no longer take the bet: new in­for­ma­tion can in­val­i­date ac­tions that oth­er­wise would have been taken. If the sug­ges­tion-de­vice is un­trusted, this can be par­tic­u­larly dan­ger­ous, be­cause Σ can with­hold in­for­ma­tion mal­i­ciously.

We can hack our way around this prob­lem by re­quiring that ev­ery prob­a­bil­is­tic vari­able be ob­served by A0 be­fore it may choose an ac­tion. A1, know­ing this, can pro­ceed to trust A0. A for­mal­iza­tion of this pro­cess is given on pages 35-36. The re­sult is a tech­ni­cal suc­cess — the par­ent can still trust the child so long as the child ob­serves all data be­fore act­ing — but this for­mal­ism falls short on many counts.

For ex­am­ple, the given for­mal­ism can­not as­sign prior prob­a­bil­ities to vari­ables. (In the given for­mal­ism, prob­a­bil­is­tic vari­ables are left unas­signed un­til the ob­ser­va­tions comes in. If the sys­tem tried to as­sign ν=0.7 as a prior prob­a­bil­ity and later as­signed ν=1 ac­cord­ing to sen­sor data, the sys­tem would ar­rive at the con­tra­dic­tory 0.7=1.)

This is ob­vi­ously a very rudi­men­tary for­mal­iza­tion of util­ity satis­fac­tion, and there is much work to be done.

Utility Management

The above sec­tion dealt with util­ity satis­fac­tion, not util­ity max­i­miza­tion. This was in­ten­tional, as perfect util­ity max­i­miza­tion is im­pos­si­ble in all but the most triv­ial of situ­a­tions.

Utility satis­fac­tion, how­ever, is far from ideal. To take an ex­am­ple from the pa­per, con­sider an agent built to satisfy at least E[U]=2*10^15. If it finds an ac­tion that guaran­tees E[U]=10^16, then the agent may de­cide to ran­dom­ize be­tween a 30% prob­a­bil­ity of tak­ing the 10^16, a 60% prob­a­bil­ity of do­ing noth­ing, and a 10% prob­a­bil­ity of burn­ing 10^16 utils. This is… not ideal, to say the least.

You get what you pay for, no more than you asked for, and ex­actly what you proved.

An al­ter­na­tive means of util­ity man­age­ment is a “me­lioriz­ing” crite­ria of ac­tion, wherein an agent picks an ac­tion and then de­cides whether to swap its cur­rent ac­tion for an­other as op­por­tu­ni­ties arise. This strat­egy is not with­out its flaws. To lift an­other ex­am­ple from the pa­per, imag­ine an agent built to save hu­man­ity from a deadly as­ter­oid. Imag­ine it starts with the de­fault ac­tion of “do noth­ing”. If the first ac­tion it comes across is “im­me­di­ately take an ac­tion which will save 1000 hu­mans, then shut down”, it will take that ac­tion and then shut down — for­go­ing the op­por­tu­nity to save billions of hu­man lives. Also not ideal.

That said, me­lioriza­tion at least shows a bit of promise, whereas satis­fic­ing is un­ten­able and max­i­miza­tion is un­re­al­is­tic. It’s worth not­ing, though, that we don’t even un­der­stand — not even at the high­est pos­si­ble level — how a prob­a­bil­is­tic ra­tio­nal agent should go about pur­su­ing its goals. This is the pa­per’s end­ing note:

The prob­lem is very wide open in­deed.



I’m pretty con­fi­dent about most of these, but some may be due to a mi­s­un­der­stand­ing, and are suffixed with (?).

  • p10. ψ ↔ φ(x) ∨ □τ[ψ(x)], this should be ψ(x) ↔ … (?)

  • p17. (4.13) The lead­ing A_t^0 has a stray ‘t’ af­ter it.

  • p17. (4.14) is miss­ing a b_{i,u}\bar in­side the □_0, be­fore the “→ ∀υ”.

  • p17. (4.14) is miss­ing a b_{i,t}\bar be­fore the “→ ∀ν”.

  • p18. (4.23) there is a K that should be a κ.

  • p18. (4.24) there is an ex­tra­ne­ous (A_t^0 →) af­ter the ⊩

  • p18. (4.24) there should be a −1 af­ter ∀ν≤υ+κ. (Alter­na­tively, change the ≤ to <) (?)

  • p18. (4.25) there is an ex­tra­ne­ous (A_t^0 →) af­ter the ⊩

  • p18. (4.25) there should be a −1 af­ter ∀υ≤t+κ. (Alter­na­tively, change the ≤ to <) (?)

  • p18. (4.26) there should be a −1 af­ter ∀υ≤t+κ. (Alter­na­tively, change the ≤ to <) (?)

  • p28. “If we were in the shoes of A1 build­ing A1 …” should be “… build­ing A0 …”.

  • p29. “… it should not dis­turb our trust in A1.” should be “… in A0.”

  • p31. (6.9) (τ ⊧ @A0) should be just @A0 (?)


  • p10. “Either x im­plies (∅ ∨ G), or …” should read “Either the ex­e­cu­tion of x im­plies (∅ or G), or …”

  • p13. You use ≡ to in­tro­duce ψ and ≜ to in­tro­duce T-n. Later (p27) you use = to in­tro­duce a differ­ent ψ. Are these differ­ences in­ten­tional?

  • p19. Us­ing P() to de­note the func­tion P seems some­what strange. The prob­a­bil­is­tic re­flec­tion pa­per sim­ply calls it P. Other texts I’ve read use the style P(—). I haven’t seen the style P() be­fore. (This may just be in­ex­pe­rience.)

  • In in­finite de­scent, why do the the­o­ries walk down­wards? Why not use T0, T1, T2, …? Walk­ing down­wards causes a bit of awk­ward syn­tax (T-(n+1)) that seems un­nec­es­sary.

And while I’m on the sub­ject, it felt for­eign that A0 is the youngest agent. I would ex­pect A0 to be the old­est agent. In one sense, I ap­pre­ci­ated that A0 was the youngling: it lends it­self well to the idea that we have this gi­gan­tic tower of agents build­ing agents build­ing agents, and then at the very end of the tower we have the fi­nal agent that af­fects the en­vi­ron­ment (which is why A0 is spe­cial). How­ever, one of the nice things about this for­mal­ism is that it doesn’t re­ally treat en­vi­ron­ment-ac­tions differ­ently from agent-cre­at­ing-ac­tions, so we don’t need this dis­tinc­tion. I found it counter-in­tu­itive to con­sider the “end” of the tiling schema, prefer­ring to think of it from the other di­rec­tion: we start with an agent run­ning on trusted math (agent zero) and let it tile in­definitely.

All in all, it’s not a par­tic­u­larly bad choice, it just felt un­nat­u­ral to me.


Sec­tion 5.2 seems like a non-se­quitur. 5.1 and 5.3 dis­cuss prin­ci­ples that agents should have. By con­trast, 5.2 men­tions tiling and gives ex­am­ples of how trust can be de­com­posed. It was weird to be in­tro­duced to tiling af­ter spend­ing sec­tion 4 prov­ing that tiling is pos­si­ble while pre­serv­ing trust. Also, the dis­cus­sion about de­com­po­si­tion of trust didn’t feel at home in a sec­tion de­tailing the tiling prin­ci­ple.

I’d recom­mend mov­ing sec­tion 5 up to be­fore sec­tion 3 or pos­si­bly be­fore sec­tion 2. This would give us a bet­ter idea of what we’re look­ing for in self-mod­ify­ing agents be­fore we are in­tro­duced to the ob­sta­cles. I’d also sug­gest break­ing “de­com­po­si­tion of trust” out of the “tiling prin­ci­ple” sec­tion. 5.2 feels like it’s try­ing to in­tro­duce one con­cept while ex­plain­ing an­other.


  • p17. Syn­tax ques­tion: Why are (4.15) and (4.17) in­dented more than ev­ery­thing else?

  • p27. I’m con­fused about the “split de­ci­sion crite­rion”. Acts de­notes the set of ac­tions that are available af­ter the ex­e­cu­tion of the (in­ter­nalOk) ac­tion act, yes?