# Troll Bridge

All of the re­sults in this post, and most of the in­for­mal ob­ser­va­tions/​in­ter­pre­ta­tions, are due to Sam Eisen­stat. I think the Troll Bridge story, as a way to make the de­ci­sion prob­lem un­der­stand­able, is due to Tsvi; but I’m not sure.

## Pure Logic Version

Troll Bridge is a de­ci­sion prob­lem which has been float­ing around for a while, but which has lacked a good in­tro­duc­tory post. The origi­nal post gives the es­sen­tial ex­am­ple, but it lacks the “troll bridge” story, which (1) makes it hard to un­der­stand, since it is just stated in math­e­mat­i­cal ab­strac­tion, and (2) makes it difficult to find if you search for “troll bridge”.

The ba­sic idea is that you want to cross a bridge. How­ever, there is a troll who will blow up the bridge with you on it, if (and only if) you cross it “for a dumb rea­son” — for ex­am­ple, due to un­sound logic. You can get to where you want to go by a worse path (through the stream). This path is bet­ter than be­ing blown up, though.

We ap­ply a Löbian proof to show not only that you choose not to cross, but fur­ther­more, that your coun­ter­fac­tual rea­son­ing is con­fi­dent that the bridge would have blown up if you had crossed. This is sup­posed to be a coun­terex­am­ple to var­i­ous pro­posed no­tions of coun­ter­fac­tual, and for var­i­ous pro­posed de­ci­sion the­o­ries.

The pseu­docode for the en­vi­ron­ment (more speci­fi­cally, the util­ity gained from the en­vi­ron­ment) is as fol­lows:

IE, if the agent crosses the bridge and is in­con­sis­tent, then U=-10. (□⊥ means “PA proves an in­con­sis­tency”.) Other­wise, if the agent crosses the bridge, U=+10. If nei­ther of these (IE, the agent does not cross the bridge), U=0.

The pseu­docode for the agent could be as fol­lows:

This is a lit­tle more com­pli­cated, but the idea is sup­posed to be that you search for ev­ery “ac­tion im­plies util­ity” pair, and take the ac­tion for which you can prove the high­est util­ity (with some tie-break­ing pro­ce­dure). Im­por­tantly, this is the kind of proof-based de­ci­sion the­ory which elimi­nates spu­ri­ous coun­ter­fac­tu­als in 5-and-10 type prob­lems. It isn’t that easy to trip up with Löbian proofs. (His­tor­i­cal/​ter­minolog­i­cal note: This de­ci­sion the­ory was ini­tially called MUDT, and is still some­times referred to in that way. How­ever, I now of­ten call it proof-based de­ci­sion the­ory, be­cause it isn’t cen­trally a UDT. “Mo­dal DT” (MDT) would be rea­son­able, but the modal op­er­a­tor in­volved is the “prov­abil­ity” op­er­a­tor, so “proof-based DT” seems more di­rect.)

Now, the proof:

• Rea­son­ing within PA (ie, the logic of the agent):

• Sup­pose the agent crosses.

• Fur­ther sup­pose that the agent proves that cross­ing im­plies U=-10.

• Ex­am­in­ing the source code of the agent, be­cause we’re as­sum­ing the agent crosses, ei­ther PA proved that cross­ing im­plies U=+10, or it proved that cross­ing im­plies U=0.

• So, ei­ther way, PA is in­con­sis­tent—by way of 0=-10 or +10=-10.

• So the troll ac­tu­ally blows up the bridge, and re­ally, U=-10.

• There­fore (pop­ping out of the sec­ond as­sump­tion), if the agent proves that cross­ing im­plies U=-10, then in fact cross­ing im­plies U=-10.

• By Löb’s the­o­rem, cross­ing re­ally im­plies U=-10.

• So (since we’re still un­der the as­sump­tion that the agent crosses), U=-10.

• So (pop­ping out of the as­sump­tion that the agent crosses), the agent cross­ing im­plies U=-10.

• Since we proved all of this in PA, the agent proves it, and proves no bet­ter util­ity in ad­di­tion (un­less PA is truly in­con­sis­tent). On the other hand, it will prove that not cross­ing gives it a safe U=0. So it will in fact not cross.

The para­dox­i­cal as­pect of this ex­am­ple is not that the agent doesn’t cross—it makes sense that a proof-based agent can’t cross a bridge whose safety is de­pen­dent on the agent’s own logic be­ing con­sis­tent, since proof-based agents can’t know whether their logic is con­sis­tent. Rather, the point is that the agent’s “coun­ter­fac­tual” rea­son­ing looks crazy. (How­ever, keep read­ing for a ver­sion of the ar­gu­ment where it does make the agent take the wrong ac­tion.) Ar­guably, the agent should be un­cer­tain of what hap­pens if it crosses the bridge, rather than cer­tain that the bridge would blow up. Fur­ther­more, the agent is rea­son­ing as if it can con­trol whether PA is con­sis­tent, which is ar­guably wrong.

In a com­ment, Stu­art points out that this rea­son­ing seems highly de­pen­dent on the code of the agent; the “else” clause could be differ­ent, and the ar­gu­ment falls apart. I think the ar­gu­ment keeps its force:

• On the one hand, it’s still very con­cern­ing if the sen­si­bil­ity of the agent de­pends greatly on which ac­tion it performs in the “else” case.

• On the other hand, we can mod­ify the troll’s be­hav­ior to match the mod­ified agent. The gen­eral rule is that the troll blows up the bridge if the agent would cross for a “dumb rea­son”—the agent then con­cludes that the bridge would be blown up if it crossed. I can no longer com­plain that the agent rea­sons as if it were con­trol­ling the con­sis­tency of PA, but I can still com­plain that the agent thinks an ac­tion is bad be­cause that ac­tion in­di­cates its own in­san­ity, due to a trou­blingly cir­cu­lar ar­gu­ment.

## Anal­ogy to Smok­ing Lesion

One in­ter­pre­ta­tion of this thought-ex­per­i­ment is that it shows proof-based de­ci­sion the­ory to be es­sen­tially a ver­sion of EDT, in that it has EDT-like be­hav­ior for Smok­ing Le­sion. The anal­ogy to Smok­ing Le­sion is rel­a­tively strong:

• An agent is at risk of hav­ing a sig­nifi­cant in­ter­nal is­sue. (In Smok­ing Le­sion, it’s a med­i­cal is­sue. In Troll Bridge, it is log­i­cal in­con­sis­tency.)

• The in­ter­nal is­sue would bias the agent to­ward a par­tic­u­lar ac­tion. (In Smok­ing Le­sion, the agent smokes. In Troll Bridge, an in­con­sis­tent agent crosses the bridge.)

• The in­ter­nal is­sue also causes some imag­ined prac­ti­cal prob­lem for the agent. (In Smok­ing Le­sion, the le­sion makes one more likely to get can­cer. In Troll Bridge, the in­con­sis­tency would make the troll blow up the bridge.)

• There is a chain of rea­son­ing which com­bines these facts to stop the agent from tak­ing the ac­tion. (In smok­ing le­sion, EDT re­fuses to smoke due to the cor­re­la­tion with can­cer. In Troll Bridge, the proof-based agent re­fuses to cross the bridge be­cause of a Löbian proof that cross­ing the bridge leads to dis­aster.)

• We in­tu­itively find the con­clu­sion non­sen­si­cal. (It seems the EDT agent should smoke; it seems the proof-based agent should not ex­pect the bridge to ex­plode.)

In­deed, the anal­ogy to smok­ing le­sion seems to strengthen the fi­nal point—that the coun­ter­fac­tual rea­son­ing is wrong.

I’ve come to think of Troll Bridge as “the real smok­ing le­sion”, since I’m gen­er­ally not satis­fied with how smok­ing le­sion is set up.

But is proof-based de­ci­sion the­ory re­ally a ver­sion of EDT? I think there’s more to say about the anal­ogy, but a sim­ple ex­pla­na­tion is this: both EDT and proof-based de­ci­sion the­ory eval­u­ate ac­tions by adding them to the knowl­edge base and see­ing what the world looks like un­der that ad­di­tional as­sump­tion. Or, to put it differ­ently, proof-based DT rea­sons about ac­tions as if they’re ob­ser­va­tions. That’s the fun­da­men­tal idea of ev­i­den­tial de­ci­sion the­ory.

## Prob­a­bil­is­tic Version

For the purely log­i­cal ver­sion, I said that we can’t fault the con­clu­sion (be­cause the agent can’t prove that it is safe to cross the bridge) – only the rea­son­ing is be­ing cri­tiqued. How­ever, the prob­a­bil­is­tic ver­sion bet­ter demon­strates the sever­ity of the rea­son­ing er­ror; it illus­trates an EDT agent mak­ing what seems to be an out­right mis­take.

It’s rea­son­able to avoid cross­ing the bridge if you’re only think­ing about prov­able re­sults. But we might ex­pect an agent who bal­ances risks and re­wards would de­cide to cross, so long as it be­lieves the prob­a­bil­ity of PA be­ing in­con­sis­tent is low. Thus turns out not to be the case.

Here’s the setup.

Sup­pose that the agent uses a prob­a­bil­ity dis­tri­bu­tion which “re­spects logic” in the sense that it as­signs prob­a­bil­ity zero to any­thing which is log­i­cally re­futable. (This as­sumes log­i­cal om­ni­science. I’ll drop that as­sump­tion later.)

The agent is as fol­lows: if P(cross)=0, cross; if P(¬cross)=0, don’t cross; else, take the ac­tion with high­est ex­pected util­ity, break­ing ties by not cross­ing. The idea is that we are try­ing to take the high­est-ex­pected-util­ity ac­tion; how­ever, we must han­dle cases where this isn’t well-defined. Choos­ing to take an ac­tion when the prob­a­bil­ity of the ac­tion is zero guaran­tees that the agent can never prove its own ac­tion; it is a ver­sion of the chicken rule. Break­ing ties by not cross­ing is ar­bi­trary, but gives the sim­plest ver­sion of the ex­am­ple, be­cause this choice re­duces the num­ber of cases we need to check.

The troll is as fol­lows: if the agent crosses the bridge due to the P(cross)=0 clause, blow up the bridge.

Rea­son­ing in the logic of the agent:

• Sup­pose

• Sup­pose

• Ex­am­in­ing the agent, ei­ther cross­ing had higher ex­pected util­ity, or P(cross)=0. But we as­sumed , so it must be the lat­ter. So the bridge gets blown up.

• There­fore, .

• There­fore,

Thus, by Löb’s the­o­rem, we see that .

There­fore, since the agent is cer­tain of things which are prov­able, the agent as­signs ex­pected value −10 to cross­ing. Since the ex­pected value of the other ac­tion will always be higher than this, and since we know the agent doesn’t ever re­ally en­ter into the P=0 con­di­tions, the agent will choose not to cross.

No­tice that this rea­son­ing did not de­pend much on the val­ues 10, 0, and −10. The util­ity of the bridge get­ting blown up could be −0.000001, and the agent still won’t cross. It isn’t weigh­ing the risk; it’s de­cided that the worst out­come is in­evitable. In the case of proof-based agents, I said that the over­all de­ci­sion not to cross was un­der­stand­able, since proof-based agents are un­able to weigh the risks. A prob­a­bil­is­tic agent, how­ever, should in­tu­itively be able to say “I don’t know whether the bridge will get blown up, be­cause it in­volves rea­son­ing about prop­er­ties of my­self which I’m fun­da­men­tally un­cer­tain about; but, the odds look de­cent.” But that’s not what hap­pens: in­stead, it is sure that cross­ing is un­fa­vor­able, no mat­ter what over­all prob­a­bil­ity it as­signs to P(A=cross)=0.

So, in this case we con­clude that the Troll Bridge ex­am­ple re­sults in a chicken-rule-based agent tak­ing the wrong ac­tion over­all. The agent shouldn’t be sure that it would cross “for the right rea­son” (it should as­sign some prob­a­bil­ity to P(A=cross)=0, since it can’t know that its own logic is con­sis­tent). How­ever, in­tu­itively, it should be able to as­sign some prob­a­bil­ity to this, and bal­ance the risks. If the down­side risk is U=-0.000001, and the prob­a­bil­ity it as­signs to its logic be­ing con­sis­tent is not similarly small, it should cross—and in do­ing so, it would get +10.

As men­tioned for the proof-based agent, the agent’s code is a bit ar­bi­trary, and it is worth ask­ing how im­por­tant the de­tails were. In par­tic­u­lar, the de­fault in the case of a tie was to not cross. What if the de­fault in case of a tie were to cross?

We then mod­ify the troll’s al­gorithm to blow up the bridge if and only if P(A=cross)=0 or there is a tie. The proof then goes through in the same way.

Per­haps you think that the prob­lem with the above ver­sion is that I as­sumed log­i­cal om­ni­science. It is un­re­al­is­tic to sup­pose that agents have be­liefs which perfectly re­spect logic. (Un)For­tu­nately, the ar­gu­ment doesn’t re­ally de­pend on this; it only re­quires that the agent re­spects proofs which it can see, and even­tu­ally sees the Löbian proof refer­enced.

## Ran­dom Exploration

The frus­trat­ing thing about Troll Bridge is that it seems like the agent could just cross the bridge, and things would be fine. The proof that things wouldn’t be fine re­lies on the fact that the agent ac­cepts that very proof as suffi­cient rea­son; so can’t we just ig­nore that kind of proof some­how?

One thing you might try is to con­sider a learn­ing agent, and force ran­dom ex­plo­ra­tion so the agent just crosses the bridge some­times. If the agent crosses the bridge, it should be able to see that it’s safe, right?

How­ever, we have to ask: what’s the ap­pro­pri­ate ver­sion of Troll Bridge for the ex­plor­ing agent? Re­mem­ber I said that the ba­sic idea of Troll Bridge is that the troll blows up the bridge if the agent crosses “for a dumb rea­son”—ie, for a rea­son other than “cross­ing is a good idea”. Ran­dom ex­plo­ra­tion falls into this cat­e­gory.

Fur­ther­more, ran­dom ex­plo­ra­tion is re­ally closely analo­gous to the chicken rule. Con­sider:

• (A) The first chicken rule I men­tioned: if you prove that you don’t take an ac­tion, you im­me­di­ately take it.

• (B) The sec­ond chicken rule men­tioned: if P(A)=0, then do A.

• (C) Ep­silon-chicken: if P(A)<ep­silon, then do A.

• (D) Ep­silon-ex­plo­ra­tion: ran­domly do A with prob­a­bil­ity ep­silon.

(A) is very similar to (B) if the prob­a­bil­ity dis­tri­bu­tion re­spects proofs in the way I men­tioned pre­vi­ously; if the prob­a­bil­ity dis­tri­bu­tion has the fur­ther prop­erty of as­sign­ing pos­i­tive prob­a­bil­ity to any­thing not dis­proven, then they’re ex­actly the same. (C) is a nat­u­ral gen­er­al­iza­tion of (B). (B) tells us what to do when the EDT de­ci­sion rule isn’t well-defined (we can’t con­di­tion on prob­a­bil­ity zero events). (C) gen­er­al­izes this by treat­ing small-prob­a­bil­ity events in the same way (we might ex­pect our con­di­tional be­liefs to be bad for small-prob­a­bil­ity events, since we pre­sum­ably don’t have much ex­pe­rience with them; so we treat it as if it’s the un­defined case). (B) en­sures there are no zero-prob­a­bil­ity ac­tions, which means we don’t ac­tu­ally ever have to deal with the EDT de­ci­sion rule be­ing non-well-defined; similarly, (C) en­sures no ac­tion is al­most surely never taken, which helps en­sure that our con­di­tional ex­pec­ta­tions for all ac­tions are em­piri­cally grounded.

(C) and (D) are very similar; (C) just sup­plies pseu­do­ran­dom­ness vie the agent’s own prob­a­bil­ities. When­ever the agent is too con­fi­dent about what it does, it does some­thing else. This makes the ran­dom­iza­tion un­pre­dictable too the agent. (D) sup­plies ran­dom­ness from an ex­ter­nal source in­stead.

So we have a chain con­nect­ing the chicken rule to ran­dom ex­plo­ra­tion.

(Another way of think­ing about the similar­ity: The chicken rule en­sures that there are mod­els of the agent’s be­liefs in which each ac­tion is taken, so that the agent can have sen­si­ble log­i­cal rea­son­ing about each al­ter­na­tive. Ep­silon-ex­plo­ra­tion en­sures that the em­piri­cal ex­pe­riences of the agent in­clude all of the ac­tions, so that the agent can have rea­son­able em­piri­cal es­ti­mates of con­se­quences of each ac­tion.)

OK. So Troll Bridge for an ex­plor­ing agent should blow up the bridge on ex­plo­ra­tion, or if PA is in­con­sis­tent. An agent could look like this:

• With small prob­a­bil­ity, ex­plore.

• If ex­plor­ing, flip a coin to cross or not cross.

• If P(cross)=0, cross.

• If P(¬cross)=0, don’t cross.

• Other­wise, take the max-ex­pected-value ac­tion, break­ing ties ran­domly.

I’m still as­sum­ing that the agent’s prob­a­bil­ity dis­tri­bu­tion re­spects proofs, as be­fore. I’m also as­sum­ing this agent is play­ing the game re­peat­edly, and learn­ing. I also must as­sume that the agent has found Now, the agent rea­sons:

• Sup­pose (cross u=-10) for a par­tic­u­lar round.

• Fur­ther sup­pose I crossed on that round.

• By the first sup­po­si­tion, I knew the pay­out of cross­ing to be low; and I must also have known that the pay­out of not cross­ing is higher, since I can prove that. Since I can prove what both pay­outs are, the ex­pected val­ues must equal those, un­less PA is in­con­sis­tent (in which case P(cross)=0 any­way, since my be­liefs re­spect proofs). So I can only be cross­ing the bridge for two rea­sons—ei­ther this is an ex­plo­ra­tion round, or P(cross)=0.

• In ei­ther case, cross­ing the bridge yields pay­out u=-10.

• There­fore, cross u=-10 in fact.

• So (cross u=-10) (cross u=-10).

Since the agent proves that a proof of cross­ing be­ing bad im­plies cross­ing is ac­tu­ally bad, the agent fur­ther must prove that cross­ing is bad in fact, by Löb.

I did this for the log­i­cally om­ni­scient case again, but as be­fore, I claim that you can trans­late the above proof to work in the case that the agent’s be­liefs re­spect proofs it can find. That’s maybe a bit weird, though, be­cause it in­volves a Bayesian agent up­dat­ing on log­i­cal proofs; we know this isn’t a par­tic­u­larly good way of han­dling log­i­cal un­cer­tainty.

We can use log­i­cal in­duc­tion in­stead, us­ing an ep­silon-ex­plor­ing ver­sion of LIDT. We con­sider LIDT on a se­quence of troll-bridge prob­lems, and show that it even­tu­ally no­tices the Löbian proof and starts re­fus­ing to cross. This is even more frus­trat­ing than the pre­vi­ous ex­am­ples, be­cause LIDT might suc­cess­fully cross for a long time, ap­par­ently learn­ing that cross­ing is safe, and re­li­ably gets +10 pay­off. Then, one day, it finds the Löbian proof and stops cross­ing the bridge!

That case is a lit­tle more com­pli­cated to work out than the Bayesian prob­a­bil­ity case, and I omit the proof here.

## Non-ex­am­ples: RL

On the other hand, con­sider an agent which uses ran­dom ex­plo­ra­tion but doesn’t do any log­i­cal rea­son­ing, like a typ­i­cal RL agent. Such an agent doesn’t need any chicken rule, since it doesn’t care about proofs of what it’ll do. It still needs to ex­plore, though. So the troll can blow up the bridge when­ever the RL agent crosses due to ex­plo­ra­tion.

This ob­vi­ously messes with the RL agent’s abil­ity to learn to cross the bridge. The RL agent might never learn to cross, since ev­ery time it tries it, it looks bad. So this is sort of similar to Troll Bridge.

How­ever, I think this isn’t re­ally the point of Troll Bridge. The key differ­ence is this: the RL agent can get past the bridge if its prior ex­pec­ta­tion that cross­ing is a good idea is high enough. It just starts out cross­ing, and hap­pily crosses all the time.

Troll Bridge is about the in­evitable con­fi­dence that cross­ing the bridge is bad. We would be fine if an agent de­cided not to cross be­cause it as­signed high prob­a­bil­ity to PA be­ing in­con­sis­tent. The RL ex­am­ple seems similar in that it de­pends on the agent’s prior.

We could try to al­ter the ex­am­ple to get that kind of in­evita­bil­ity. Maybe we ar­gue it’s still “dumb” to cross only be­cause you start with a high prior prob­a­bil­ity of it be­ing good. Have the troll pun­ish cross­ing un­less the cross­ing is jus­tified by an em­piri­cal his­tory of cross­ing be­ing good. Then RL agents do poorly no mat­ter what—no one can get the good out­come in or­der to build up the his­tory, since get­ting the good out­come re­quires the his­tory.

But this still doesn’t seem so in­ter­est­ing. You’re just mess­ing with these agents. It isn’t illus­trat­ing the de­gree of patholog­i­cal rea­son­ing which the Löbian proof illus­trates—of course you don’t put your hand in the fire if you get burned ev­ery sin­gle time you try it. There’s noth­ing wrong with the way the RL agent is re­act­ing!

So, Troll Bridge seems to be more ex­clu­sively about agents who do rea­son log­i­cally.

## Conclusions

All of the ex­am­ples have de­pended on a ver­sion of the chicken rule. This leaves us with a fas­ci­nat­ing catch-22:

• We need the chicken rule to avoid spu­ri­ous proofs. As a re­minder: spu­ri­ous proofs are cases where an agent would re­ject an ac­tion if it could prove that it would not take that ac­tion. Th­ese ac­tions can then be re­jected by an ap­pli­ca­tion of Löb’s the­o­rem. The chicken rule avoids this prob­lem by en­sur­ing that agents can­not know their own ac­tions, since if they did then they’d take a differ­ent ac­tion from the one they know they’ll take (and they know this, con­di­tional on their logic be­ing con­sis­tent).

• How­ever, Troll Bridge shows that the chicken rule can lead to an­other kind of prob­le­matic Löbian proof.

So, we might take Troll Bridge to show that the chicken rule does not achieve its goal, and there­fore re­ject the chicken rule. How­ever, this con­clu­sion is very se­vere. We can­not sim­ply drop the chicken rule and open the gates to the (much more com­mon!) spu­ri­ous proofs. We would need an al­to­gether differ­ent way of re­ject­ing the spu­ri­ous proofs; per­haps a full ac­count of log­i­cal coun­ter­fac­tu­als.

Fur­ther­more, it is pos­si­ble to come up with var­i­ants of Troll Bridge which counter some such pro­pos­als. In par­tic­u­lar, Troll Bridge was origi­nally in­vented to counter proof-length coun­ter­fac­tu­als, which es­sen­tially gen­er­al­ize chicken rules, and there­fore lead to the same Troll Bridge prob­lems).

Another pos­si­ble con­clu­sion could be that Troll Bridge is sim­ply too hard, and we need to ac­cept that agents will be vuln­er­a­ble to this kind of rea­son­ing.

• Writ­ten slightly differ­ently, the rea­son­ing seems sane: Sup­pose I cross. I must have proven it’s a good idea. Aka I proved that I’m con­sis­tent. Aka I’m in­con­sis­tent. Aka the bridge blows up. Bet­ter not cross.

• I agree with your English char­ac­ter­i­za­tion, and I also agree that it isn’t re­ally ob­vi­ous that the rea­son­ing is patholog­i­cal. How­ever, I don’t think it is so ob­vi­ously sane, ei­ther.

• It seems like coun­ter­fac­tual rea­son­ing about al­ter­na­tive ac­tions should avoid go­ing through “I’m ob­vi­ously in­sane” in al­most ev­ery case; pos­si­bly in ev­ery case. If you think about what would hap­pen if you made a par­tic­u­lar chess move, you need to di­vorce the con­se­quences from any “I’m ob­vi­ously in­sane in that sce­nario, so the rest of my moves in the game will be ter­rible” type rea­son­ing. You CAN’T as­sess that mak­ing a move would be in­sane UNTIL you rea­son out the con­se­quences w/​o any pre­sump­tion of in­san­ity; oth­er­wise, you might end up avoid­ing a move only be­cause it looks in­sane (and it looks in­sane only be­cause you avoid it, so you think you’ve gone mad if you take it). This prin­ci­ple seems po­ten­tially strong enough that you’d want to ap­ply it to the Troll Bridge case as well, even though in Troll Bridge it won’t ac­tu­ally help us make the right de­ci­sion (it just sug­gests that ex­pect­ing the bridge to blow up isn’t a le­git coun­ter­fac­tual).

• Also, coun­ter­fac­tu­als which pre­dict that the bridge blows up seem to be say­ing that the agent can con­trol whether PA is con­sis­tent or in­con­sis­tent. That might be con­sid­ered un­re­al­is­tic.

• Troll Bridge is a rare case where agents that re­quire proof to take ac­tion can prove they would be in­sane to take some ac­tion be­fore they’ve thought through its con­se­quences. Can you show how they could un­wisely do this in chess, or some sort of Troll Chess?

I don’t see how this agent seems to con­trol his san­ity. Does the agent who jumps off a roof iff he can (falsely) prove it wise choose whether he’s in­sane by choos­ing whether he jumps?

• I don’t see how this agent seems to con­trol his san­ity.

The agent in Troll Bridge thinks that it can make it­self in­sane by cross­ing the bridge. (Maybe this doesn’t an­swer your ques­tion?)

Troll Bridge is a rare case where agents that re­quire proof to take ac­tion can prove they would be in­sane to take some ac­tion be­fore they’ve thought through its con­se­quences. Can you show how they could un­wisely do this in chess, or some sort of Troll Chess?

I make no claim that this sort of case is com­mon. Sce­nar­ios where it comes up and is rele­vant to X-risk might in­volve alien su­per­in­tel­li­gences trol­ling hu­man-made AGI. But it isn’t ex­actly high on my list of con­cerns. The ques­tion is more about whether par­tic­u­lar the­o­ries of coun­ter­fac­tual are right. Troll Bridge might be “too hard” in some sense—we may just have to give up on it. But, gen­er­ally, these weird philo­soph­i­cal coun­terex­am­ples are more about point­ing out prob­lems. Com­plex real-life situ­a­tions are difficult to deal with (in terms of rea­son­ing about what a par­tic­u­lar the­ory of coun­ter­fac­tu­als will ac­tu­ally do), so we check sim­ple ex­am­ples, even if they’re out­landish, to get a bet­ter idea of what the coun­ter­fac­tu­als are do­ing in gen­eral.

• (Maybe this doesn’t an­swer your ques­tion?)

Cor­rect. I am try­ing to pin down ex­actly what you mean by an agent con­trol­ling a log­i­cal state­ment. To that end, I ask whether an agent that takes an ac­tion iff a state­ment is true con­trols the state­ment through choos­ing whether to take the ac­tion. (“The Killing Curse doesn’t crack your soul. It just takes a cracked soul to cast.”)

Per­haps we could equip logic with a “cau­sa­tion” pre­order such that all tau­tolo­gies are equiv­a­lent, cau­sa­tion im­plies im­pli­ca­tion, and when­ever we define an agent, we equip its con­trol cir­cuits with cau­sa­tion. Then we could say that A doesn’t cross the bridge be­cause it’s not in­sane. (I per­haps con­tentiously as­sume that in­san­ity and prov­ing san­ity are causally equiv­a­lent.)

If we re­ally wanted to, we could in­ves­ti­gate the agent that only ac­cepts util­ity proofs that don’t go causally back­wards. (Or rather, it re­quires that its ac­tion prov­ably causes the util­ity.)

You claimed this rea­son­ing is un­wise in chess. Can you give a sim­ple ex­am­ple illus­trat­ing this?

• Cor­rect. I am try­ing to pin down ex­actly what you mean by an agent con­trol­ling a log­i­cal state­ment. To that end, I ask whether an agent that takes an ac­tion iff a state­ment is true con­trols the state­ment through choos­ing whether to take the ac­tion. (“The Killing Curse doesn’t crack your soul. It just takes a cracked soul to cast.”)

The point here is that the agent de­scribed is act­ing like EDT is sup­posed to—it is check­ing whether its ac­tion im­plies X. If yes, it is act­ing as if it con­trols X in the sense that it is de­cid­ing which ac­tion to take us­ing those im­pli­ca­tions. I’m not ar­gu­ing at all that we should think “im­plies X” is causal, nor even that the agent has opinions on the mat­ter; only that the agent seems to be do­ing some­thing wrong, and one way of an­a­lyz­ing what it is do­ing wrong is to take a CDT stance and say “the agent is be­hav­ing as if it con­trols X”—in the same way that CDT says to EDT “you are be­hav­ing as if cor­re­la­tion im­plies cau­sa­tion” even though EDT would not as­sent to this in­ter­pre­ta­tion of its de­ci­sion.

If we re­ally wanted to, we could in­ves­ti­gate the agent that only ac­cepts util­ity proofs that don’t go causally back­wards. (Or rather, it re­quires that its ac­tion prov­ably causes the util­ity.)
You claimed this rea­son­ing is un­wise in chess. Can you give a sim­ple ex­am­ple illus­trat­ing this?

I think you have me the wrong way around; I was sug­gest­ing that cer­tain causally-back­wards rea­son­ing would be un­wise in chess, not the re­verse. In par­tic­u­lar, I was sug­gest­ing that we should not judge a move poor be­cause we think the move is some­thing only a poor player would do, but always the other way around. For ex­am­ple, sup­pose we have a prior on moves which sug­gests that mov­ing a queen into dan­ger is some­thing only a poor player would do. Fur­ther sup­pose we are in a po­si­tion to move our queen into dan­ger in a way which forces check­mate in 4 moves. I’m say­ing that if we rea­son “I could move my queen into dan­ger to open up a path to check­mate in 4. How­ever, only poor play­ers move their queen into dan­ger. Poor play­ers would not suc­cess­fully nav­i­gate a check­mate-in-4. There­fore, if I move my queen into dan­ger, I ex­pect to make a mis­take cost­ing me the check­mate in 4. There­fore, I will not move my queen into dan­ger.” That’s an ex­am­ple of the mis­take I was point­ing at.

Note: I do not per­son­ally en­dorse this as an ar­gu­ment for CDT! I am ex­press­ing these ar­gu­ments be­cause it is part of the sig­nifi­cance of Troll Bridge. I think these ar­gu­ments are the kinds of things one should grap­ple with if one is grap­pling with Troll Bridge. I have defended EDT from these kinds of cri­tiques ex­ten­sively el­se­where. My defenses do not work against Troll Bridge, but they do work against the chess ex­am­ple. But I’m not go­ing into those defenses here be­cause it would dis­tract from the points rele­vant to Troll Bridge.

• If I’m a poor enough player that I merely have ev­i­dence, not proof, that the queen move mates in four, then the heuris­tic that queen sac­ri­fices usu­ally don’t work out is fine and I might use it in real life. If I can prove that queen sac­ri­fices don’t work out, the rea­son­ing is fine even for a proof-re­quiring agent. Can you give a chesslike game where some proof-re­quiring agent can prove from the rules and per­haps the player source codes that queen sac­ri­fices don’t work out, and there­fore scores worse than some other agent would have? (Per­haps through mechanisms as in Troll bridge.)

• The heuris­tic can over­ride mere ev­i­dence, agreed. The prob­lem I’m point­ing at isn’t that the heuris­tic is fun­da­men­tally bad and shouldn’t be used, but rather that it shouldn’t cir­cu­larly re­in­force its own con­clu­sion by count­ing a hy­poth­e­sized move as differ­en­tially sug­gest­ing you’re a bad player in the hy­po­thet­i­cal where you make that move. Think­ing that way seems con­trary to the spirit of the hy­po­thet­i­cal (whose pur­pose is to help eval­u­ate the move). It’s fine for the heuris­tic to sug­gest things are bad in that hy­po­thet­i­cal (be­cause you heuris­ti­cally think the move is bad); it seems much more ques­tion­able to sup­pose that your sub­se­quent moves will be worse in that hy­po­thet­i­cal, par­tic­u­larly if that in­fer­ence is a lynch­pin if your over­all nega­tive as­sess­ment of the move.

What do you want out of the chess-like ex­am­ple? Is it enough for me to say the troll could be the other player, and the bridge could be a strat­egy which you want to em­ploy? (The other player defeats the strat­egy if they think you did it for a dumb rea­son, and they let it work if they think you did it smartly, and they know you well, but you don’t know whether they think you’re dumb, but you do know that if you were be­ing dumb then you would use the strat­egy.) This is can be ex­actly troll bridge as stated in the post, but set in chess with player source code visi­ble.

I’m guess­ing that’s not what you want, but I’m not sure what you want.

• I started ask­ing for a chess ex­am­ple be­cause you im­plied that the rea­son­ing in the top-level com­ment stops be­ing sane in iter­ated games.

In a sim­ple iter­a­tion of Troll bridge, whether we’re dumb is clear af­ter the first time we cross the bridge. In a sim­ple vari­a­tion, the troll re­quires smart­ness even given past ob­ser­va­tions. In ei­ther case, the best worst-case util­ity bound re­quires never to cross the bridge, and A knows cross­ing blows A up. You seemed to ex­pect more.

Sup­pose my chess skill varies by day. If my last few moves were dumb, I shouldn’t rely on my skill to­day. I don’t see why I shouldn’t de­duce this ahead of time and, un­til I know I’m smart to­day, be ex­tra care­ful around moves that to dumb play­ers look ex­tra good and are ex­tra bad.

More con­cretely: Sup­pose that an un­known weight­ing of three sub­rou­tines ap­proval-votes on my move: Timmy likes mov­ing big pieces, Johnny likes play­ing good chess, and Spike tries to win in this meta. Sup­pose we start with move A, B or C available. A and B lead to a Johnny gam­bit that Timmy would ruin. Johnny thinks “If I play alone, A and B lead to 80% win prob­a­bil­ity and C to 75%. I ap­prove ex­actly A and B.”. Timmy gives 0, 0.2 and 1 of his max­i­mum vote to A, B and C. Spike wants the gam­bit to hap­pen iff Spike and Johnny can out­vote Timmy. Spike wants to vote for A and against B. How hard Spike votes for C trades off be­tween his test’s false pos­i­tive and false nega­tive rates. If B wins, ruin is likely. Spike’s rea­son­ing seems to re­quire those hy­po­thet­i­cal skill up­dates you don’t like.

• I started ask­ing for a chess ex­am­ple be­cause you im­plied that the rea­son­ing in the top-level com­ment stops be­ing sane in iter­ated games.
In a sim­ple iter­a­tion of Troll bridge, whether we’re dumb is clear af­ter the first time we cross the bridge.

Right, OK. I would say “se­quen­tial” rather than “iter­ated”—my point was about mak­ing a weird as­sess­ment of your own fu­ture be­hav­ior, not what you can do if you face the same sce­nario re­peat­edly. IE: Troll Bridge might be seen as ar­tifi­cial in that the en­vi­ron­ment is ex­plic­itly de­signed to pun­ish you if you’re “dumb”; but, per­haps a se­quen­tial game can pun­ish you more nat­u­rally by virtue of poor fu­ture choices.

Sup­pose my chess skill varies by day. If my last few moves were dumb, I shouldn’t rely on my skill to­day. I don’t see why I shouldn’t de­duce this ahead of time

Yep, I agree with this.

I con­cede the fol­low­ing points:

• If there is a mis­take in the troll-bridge rea­son­ing, pre­dict­ing that your next ac­tions are likely to be dumb con­di­tional on a dumb-look­ing ac­tion is not an ex­am­ple of the mis­take.

• Fur­ther­more, that in­fer­ence makes perfect sense, and if it is as analo­gous to the troll-bridge rea­son­ing as I was pre­vi­ously sug­gest­ing, the troll-bridge rea­son­ing makes sense.

How­ever, I still as­sert the fol­low­ing:

• Pre­dict­ing that your next ac­tions are likely to be dumb con­di­tional on a dumb look­ing ac­tion doesn’t make sense if the very rea­son why you think the ac­tion looks dumb is that the next ac­tions are prob­a­bly dumb if you take it.

IE, you don’t have a prior heuris­tic judge­ment that a move is one which you make when you’re dumb; rather, you’ve cir­cu­larly con­cluded that the move would be dumb—be­cause it’s likely to lead to a bad out­come—be­cause if you take that move your sub­se­quent moves are likely to be bad—be­cause it is a dumb move.

I don’t have a nat­u­ral setup which would lead to this, but the point is that it’s a crazy way to rea­son rather than a nat­u­ral one.

The ques­tion, then, is whether the troll-bridge rea­son­ing is analo­gous to to this.

I think we should prob­a­bly fo­cus on the prob­a­bil­is­tic case (re­cently added to the OP), rather than the proof-based agent. I could see my­self de­cid­ing that the proof-based agent is more analo­gous to the sane case than the crazy one. But the prob­a­bil­is­tic case seems com­pletely wrong.

In the proof-based case, the ques­tion is: do we see the Löbian proof as “cir­cu­lar” in a bad way? It makes sense to con­clude that you’d only cross the bridge when it is bad to do so, if you can see that prov­ing it’s a good idea is in­con­sis­tent. But does the proof that that’s in­con­sis­tent “go through” that very in­fer­ence? We know that the troll blows up the bridge if we’re dumb, but that in it­self doesn’t con­sti­tute out­side rea­son that cross­ing is dumb.

But I can see an ar­gu­ment that our “out­side rea­son” is that we can’t know that cross­ing is safe, and since we’re a proof-based agent, would never take the risk un­less we’re be­ing dumb.

How­ever, this rea­son­ing does not ap­ply to the prob­a­bil­is­tic agent. It can cross the bridge as a calcu­lated risk. So its rea­son­ing seems ab­solutely cir­cu­lar. There is no “prior rea­son” for it to think cross­ing is dumb; and, even if it did think it more likely dumb than not, it doesn’t seem like it should be 100% cer­tain of that. There should be some util­ities for the three out­comes which pre­serve the prefer­ence or­der­ing but which make the risk of cross­ing worth­while.

• I think the coun­ter­fac­tu­als used by the agent are the cor­rect coun­ter­fac­tu­als for some­one else to use while rea­son­ing about the agent from the out­side, but not the cor­rect coun­ter­fac­tu­als for the agent to use while de­cid­ing what to do. After all, know­ing the agent’s source code, if you see it start to cross the bridge, it is cor­rect to in­fer that it’s rea­son­ing is in­con­sis­tent, and you should ex­pect to see the troll blow up the bridge. But while de­cid­ing what to do, the agent should be able to rea­son about purely causal effects of its coun­ter­fac­tual be­hav­ior, screen­ing out other log­i­cal im­pli­ca­tions.

Also, coun­ter­fac­tu­als which pre­dict that the bridge blows up seem to be say­ing that the agent can con­trol whether PA is con­sis­tent or in­con­sis­tent.

Disagree that that’s what’s hap­pen­ing. The link be­tween the con­sis­tency of the rea­son­ing sys­tem and the be­hav­ior of the agent is be­cause the con­sis­tency of the rea­son­ing sys­tem con­trols the agent’s be­hav­ior, rather than the other way around. Since the agent is se­lect­ing out­comes based on their con­se­quences, it does make sense to speak of the agent choos­ing ac­tions to some ex­tent, but I think speak­ing of log­i­cal im­pli­ca­tions of the agent’s ac­tions on the con­sis­tency of for­mal sys­tems as “con­trol­ling” the con­sis­tency of the for­mal sys­tem seems like an in­ap­pro­pri­ate at­tri­bu­tion of agency to me.

• I agree with ev­ery­thing you say here, but I read you as think­ing you dis­agree with me.

I think the coun­ter­fac­tu­als used by the agent are the cor­rect coun­ter­fac­tu­als for some­one else to use while rea­son­ing about the agent from the out­side, but not the cor­rect coun­ter­fac­tu­als for the agent to use while de­cid­ing what to do.

Yeah, that’s the prob­lem I’m point­ing at, right?

Disagree that that’s what’s hap­pen­ing. The link be­tween the con­sis­tency of the rea­son­ing sys­tem and the be­hav­ior of the agent is be­cause the con­sis­tency of the rea­son­ing sys­tem con­trols the agent’s be­hav­ior, rather than the other way around. Since the agent is se­lect­ing out­comes based on their con­se­quences, it does make sense to speak of the agent choos­ing ac­tions to some ex­tent, but I think speak­ing of log­i­cal im­pli­ca­tions of the agent’s ac­tions on the con­sis­tency of for­mal sys­tems as “con­trol­ling” the con­sis­tency of the for­mal sys­tem seems like an in­ap­pro­pri­ate at­tri­bu­tion of agency to me.

I think we just agree on that? As I re­sponded to an­other com­ment here:

The point here is that the agent de­scribed is act­ing like EDT is sup­posed to—it is check­ing whether its ac­tion im­plies X. If yes, it is act­ing as if it con­trols X in the sense that it is de­cid­ing which ac­tion to take us­ing those im­pli­ca­tions. I’m not ar­gu­ing at all that we should think “im­plies X” is causal, nor even that the agent has opinions on the mat­ter; only that the agent seems to be do­ing some­thing wrong, and one way of an­a­lyz­ing what it is do­ing wrong is to take a CDT stance and say “the agent is be­hav­ing as if it con­trols X”—in the same way that CDT says to EDT “you are be­hav­ing as if cor­re­la­tion im­plies cau­sa­tion” even though EDT would not as­sent to this in­ter­pre­ta­tion of its de­ci­sion.
• [ ]
[deleted]
• I’m hav­ing difficulty fol­low­ing the line of the proof be­gin­ning “so, ei­ther way, PA is in­con­sis­tent”. We have and , which to­gether im­ply that , but I’m not im­me­di­ately see­ing how this leads to ?

• Ah, got there. From , we get speci­fi­cally and thus . But we have di­rectly as a the­o­rem (ax­iom?) about the be­havi­our of , and we can lift this to , so also and thus .

• Seems to me that if an agent with a rea­son­able heuris­tic for log­i­cal un­cer­tainty came upon this prob­lem, and was con­fi­dent but not cer­tain of its con­sis­tency, it would sim­ply cross be­cause ex­pected util­ity would be above zero, which is a rea­son that doesn’t be­tray an in­con­sis­tency. (Be­sides, if it sur­vived it would have good 3rd party val­i­da­tion of its own con­sis­tency, which would prob­a­bly be pretty use­ful.)

• I agree that “it seems that it should”. I’ll try and even­tu­ally edit the post to show why this is (at least) more difficult to achieve than it ap­pears. The short ver­sion is that a proof is still a proof for a log­i­cally un­cer­tain agent; so, if the Löbian proof did still work, then the agent would up­date to 100% be­liev­ing it, elimi­nat­ing its un­cer­tainty; there­fore, the proof still works (via its Löbian na­ture).

• The proof doesn’t work on a log­i­cally un­cer­tain agent. The logic fails here:

Ex­am­in­ing the source code of the agent, be­cause we’re as­sum­ing the agent crosses, ei­ther PA proved that cross­ing im­plies U=+10, or it proved that cross­ing im­plies U=0.

A log­i­cally un­cer­tain agent does not need a proof of ei­ther of those things in or­der to cross, it sim­ply needs a pos­i­tive ex­pec­ta­tion of util­ity, for ex­am­ple a heuris­tic which says that there’s a 99% chance cross­ing im­plies U=+10.

Though you did say there’s a ver­sion which still works for log­i­cal in­duc­tion. Do you have a link to where I can see that ver­sion of the ar­gu­ment?

Edit: Now I still see the logic. On the as­sump­tion that the agent crosses but also proves that U=-10, the agent must have a con­tra­dic­tion some­where, be­cause that, and the log­i­cal un­cer­tainty agents I’m aware of have a con­tra­dic­tion upon prov­ing U=-10 be­cause they prove that they will not cross, and then im­me­di­ately cross in a ma­neu­ver meant to pre­vent ex­actly this kind of prob­lem.

Wait but prov­ing cross­ing im­plies U=-10 does not mean that prove they will not cross, ex­actly be­cause they might still cross, if they have a con­tra­dic­tion.

God this stuff is con­fus­ing. I still don’t think the logic holds though.

• I’ve now ed­ited the post to give the ver­sion which I claim works in the em­piri­cally un­cer­tain case, and give more hints for how it still goes through in the fully log­i­cally un­cer­tain case.

• Viewed from the out­side, in the log­i­cal coun­ter­fac­tual where the agent crosses, PA can prove its own con­sis­tency, and so is in­con­sis­tent. There is a model of PA in which “PA proves False”. Hav­ing coun­ter­fac­tualed away all the other mod­els, these are the ones left. Log­i­cal coun­ter­fac­tu­al­ing on any state­ment that can’t be proved or dis­proved by a the­ory should pro­duce the same re­sult as adding it as an ax­iom. Ie log­i­cal coun­ter­fac­tu­al­ing ZF on choice should pro­duce ZFC.

The only un­rea­son­able­ness here comes from the agent’s worst case op­ti­miz­ing be­havi­our. This agent is ex­ces­sively cau­tious. A log­i­cal in­duc­tion agent, with PA as a de­duc­tive pro­cess will as­sign some prob P strictly be­tween 0 and 1 to “PA is con­sis­tent”. Depend­ing on which ver­sion of log­i­cal in­duc­tion you run, and how much you want to cross the bridge, cross­ing might be worth it. (the troll is still blow­ing up the bridge iff PA proves False)

A log­i­cal coun­ter­fac­tual where you don’t cross the bridge is ba­si­cally a coun­ter­fac­tual world where your de­sign of log­i­cal in­duc­tion as­signs lower prob to “PA is con­sis­tent”. In this world it doesn’t cross and gets zero.

The al­ter­na­tive is a log­i­cal fac­tual where it ex­pects +ve util.

So if we make log­i­cal in­duc­tion like cross­ing enough, and not mind get­ting blown up much, it crosses the bridge. Lets re­verse this. An agent re­ally doesn’t want blown up.

In the coun­ter­fac­tual world where it crosses, log­i­cal in­duc­tion as­signs more prob to “PA is con­sis­tant”. The ex­pected util­ity pro­ce­dure has to use its real prob­a­bil­ity dis­tri­bu­tion, not ask the coun­ter­fac­tual agent for its ex­pected util.

I am not sure what hap­pens af­ter this, I think you still need to think about what you do in im­pos­si­ble wor­lds. Still work­ing it out.

• I’ve now ed­ited the post to ad­dress un­cer­tainty more ex­ten­sively.

• I don’t to­tally dis­agree, but see my re­ply to Gurken­glas as well as my re­ply to An­drew Sauer. Uncer­tainty doesn’t re­ally save us, and the be­hav­ior isn’t re­ally due to the worst-case-min­i­miz­ing be­hav­ior. It can end up do­ing the same thing even if get­ting blown up is only slightly worse than not cross­ing! I’ll try to edit the post to add the ar­gu­ment wherein log­i­cal in­duc­tion fails even­tu­ally (maybe not for a week, though). I’m much more in­clined to say “Troll Bridge is too hard; we can’t de­mand so much of our coun­ter­fac­tu­als” than I am to say “the coun­ter­fac­tual is ac­tu­ally perfectly rea­son­able” or “the prob­lem won’t oc­cur if we have rea­son­able un­cer­tainty”.

• “(It makes sense that) A proof-based agent can’t cross a bridge whose safety is de­pen­dent on the agent’s own logic be­ing con­sis­tent, since proof-based agents can’t know whether their logic is con­sis­tent.”

If the agent crosses the bridge, then the agent knows it­self to be con­sis­tent.

The agent can­not know whether they are con­sis­tent.

There­fore, cross­ing the bridge im­plies an in­con­sis­tency (they know them­self to be con­sis­tent, even though that’s im­pos­si­ble.)

The coun­ter­fac­tual rea­son­ing seems quite rea­son­able to me.

• there is a troll who will blow up the bridge with you on it, if you cross it “for a dumb rea­son”

Does this way of writ­ing “if” mean the same thing as “iff”, i.e. “if and only if”?

• In­ter­est­ing.

I have two is­sues with the rea­son­ing as pre­sented; the sec­ond one is more im­por­tant.

First of all, I’m un­sure about “Rather, the point is that the agent’s “coun­ter­fac­tual” rea­son­ing looks crazy.” I think we don’t know the agent’s coun­ter­fac­tual rea­son­ing. We know, by Löb’s the­o­rem, that “there ex­ists a proof that (proof of L im­plies L)” im­plies “there ex­ists a proof of L”. It doesn’t tell us what struc­ture this proof of L has to take, right? Who knows what coun­ter­fac­tu­als are be­ing con­sid­ered to make that proof? (I may be mi­s­un­der­stand­ing this).

Se­cond of all, it seems that if we change the last line of the agent to [else, “cross”], the ar­gu­ment fails. Same if we in­sert [else if A()=”cross” ⊢ U=-10, then out­put “cross”; else if A()=”not cross” ⊢ U=-10, then out­put “not cross”] above the last line. In both cases, this is be­cause U=-10 is now pos­si­ble, given cross­ing. I’m sus­pi­cious when the ar­gu­ment seems to de­pend so much on the struc­ture of the agent.

To de­velop that a bit, it seems the agent’s al­gorithm as writ­ten im­plies “If I cross the bridge, I am con­sis­tent” (be­cause U=-10 is not an op­tion). If we mod­ify the al­gorithm as I just sug­gested, then that’s no longer the case; it can con­sider coun­ter­fac­tu­als where it crosses the bridge and is in­con­sis­tent (or, at least, of un­known con­sis­tency). So, given that, the agent’s coun­ter­fac­tual rea­son­ing no longer seems so crazy, even if it’s as claimed. That’s be­cause the agent’s rea­son­ing needs to de­duce some­thing from “If I cross the bridge, I am con­sis­tent” that it can’t de­duce with­out that. Given that state­ment, then be­ing Löbian or similar seems quite nat­u­ral, as those are some of the few ways of deal­ing with state­ments of that type.

• First of all, I’m un­sure about “Rather, the point is that the agent’s “coun­ter­fac­tual” rea­son­ing looks crazy.” I think we don’t know the agent’s coun­ter­fac­tual rea­son­ing. We know, by Löb’s the­o­rem, that “there ex­ists a proof that (proof of L im­plies L)” im­plies “there ex­ists a proof of L”. It doesn’t tell us what struc­ture this proof of L has to take, right? Who knows what coun­ter­fac­tu­als are be­ing con­sid­ered to make that proof? (I may be mi­s­un­der­stand­ing this).

The agent as de­scribed is us­ing prov­able con­se­quences of ac­tions to make de­ci­sions. So, it is us­ing prov­able con­se­quences as coun­ter­fac­tu­als. At least, that’s the sense in which I mean it—for­give my ter­minol­ogy if this doesn’t make sense to you. I could have said “the agent’s con­di­tional rea­son­ing” or “the agent’s con­se­quen­tial­ist rea­son­ing” etc.

Se­cond of all, it seems that if we change the last line of the agent to [else, “cross”], the ar­gu­ment fails. Same if we in­sert [else if A()=”cross” ⊢ U=-10, then out­put “cross”; else if A()=”not cross” ⊢ U=-10, then out­put “not cross”] above the last line. In both cases, this is be­cause U=-10 is now pos­si­ble, given cross­ing. I’m sus­pi­cious when the ar­gu­ment seems to de­pend so much on the struc­ture of the agent.

I think the benefit of the doubt does not go to the agent, here. If the agent’s rea­son­ing is sen­si­ble only un­der cer­tain set­tings of the de­fault ac­tion clause, then one would want a story about how to set the de­fault ac­tion clause ahead of time so that we can en­sure that the agent’s rea­son­ing is always sen­si­ble. But this seems im­pos­si­ble.

How­ever, I fur­ther­more think the ex­am­ple can be mod­ified to make the agent’s rea­son­ing look silly in any case.

If the last line of the agent reads “cross” rather than “not cross”, I think we can re­cover the ar­gu­ment by chang­ing what the troll is do­ing. The gen­eral pat­tern is sup­posed to be: the troll blows up the bridge if we cross “for a dumb rea­son”—where “dumb” is tar­geted at agents who do any­thing analo­gous to ep­silon ex­plo­ra­tion or the chicken rule.

So, we mod­ify the troll to blow up the bridge if PA is in­con­sis­tent OR if the agent reaches its “else” clause.

The agent can no longer be ac­cused of rea­son­ing as if it con­trol­led PA. How­ever, it still sees the bridge blow­ing up as a definite con­se­quence of its cross­ing the bridge. This still doesn’t seem sen­si­ble, be­cause its very rea­son for be­liev­ing this in­volves a cir­cu­lar sup­po­si­tion that it’s prov­able—much like my chess ex­am­ple where an agent con­cludes that an ac­tion is poor due to a prob­a­bil­is­tic be­lief that it would be a poor player if it took that sort of ac­tion.

[Note that I have not checked the proof for the pro­posed var­i­ant in de­tail, so, could be wrong.]

• If the agent’s rea­son­ing is sen­si­ble only un­der cer­tain set­tings of the de­fault ac­tion clause

That was my first rewrit­ing; the sec­ond is an ex­am­ple of a more gen­eral al­gorithm that would go some­thing like this. If we as­sume that both prob­a­bil­ities and util­ities are dis­crete, all of the form q/​n for some q, and bounded above and be­low by N, then some­thing like this (for EU the ex­pected util­ity, and Ac­tions the set of ac­tions, and b some de­fault ac­tion):

for q in­te­ger in N*n^2 to -N*n^2 (or­dered from high­est to low­est):
for a in Ac­tions:
if A()=a ⊢ EU=q/​n^2 then out­put a
else out­put b


Then the Löbian proof fails. The agent will fail to prove any of those “if” im­pli­ca­tions, un­til it proves “A()=”not cross” ⊢ EU=0″. Then it out­puts “not cross”; the de­fault ac­tion b is not rele­vant. Also not rele­vant, here, is the or­der in which a is sam­pled from “Ac­tions”.

• I don’t see why the proof fails here; it seems to go es­sen­tially as usual.

Rea­son­ing in PA:

Sup­pose a=cross->u=-10 were prov­able.

Fur­ther sup­pose a=cross.

Note that we can see there’s a proof that not cross­ing gets 0, so it must be that a bet­ter (or equal) value was found for cross­ing, which must have been +10 un­less PA is in­con­sis­tent, since cross­ing im­plies that u is ei­ther +10 or −10. Since we already as­sumed cross­ing gets −10, this leads to trou­ble in the usual way, and the proof pro­ceeds from there.

(Ac­tu­ally, I guess ev­ery­thing is a lit­tle messier since you haven’t stipu­lated the search or­der for ac­tions, so we have to ex­am­ine more cases. Car­ry­ing out some more de­tailed rea­son­ing: So (un­der our as­sump­tions) we know PA must have proved (a=cross → u=10). But we already sup­posed that it proves (a=cross → u=-10). So PA must prove not(a=cross). But then it must prove prove(not(a=cross)), since PA has self-knowl­edge of proofs. Which means PA can’t prove that it’ll cross, if PA is to be con­sis­tent. But it knows that prov­ing it doesn’t take an ac­tion makes that ac­tion very ap­peal­ing; so it knows it would cross, un­less not cross­ing was equally ap­peal­ing. But it can prove that not cross­ing nets zero. So the only way for not cross­ing to be equally ap­peal­ing is for PA to also prove not cross­ing nets 10. For this to be con­sis­tent, PA has to prove that the agent doesn’t not-cross. But now we have PA prov­ing that the agent doesn’t cross and also that it doesn’t not cross! So PA must be in­con­sis­tent un­der our as­sump­tions. The rest of the proof con­tinues as usual.)

• You are en­tirely cor­rect; I don’t know why I was con­fused.

How­ever, look­ing at the proof again, it seems there might be a po­ten­tial hole. You use Löb’s the­o­rem within an as­sump­tion sub-loop. This seems to as­sume that from “”, we can de­duce “”.

But this can­not be true in gen­eral! To see this, set . Then , triv­ially; if, from that, we could de­duce , we would have for any . But this state­ment, though it looks like Löb’s the­o­rem, is one that we can­not de­duce in gen­eral (see Eliezer’s “medium-hard prob­lem” here).

Can this hole be patched?

(note that if , where is a PA proof that adds A as an ex­tra ax­iom, then we can de­duce ).

• I’m find­ing some of the text in the comic slightly hard to read.

• Yep, sorry. The illus­tra­tions were not ac­tu­ally origi­nally meant for pub­li­ca­tion; they’re from my per­sonal notes. I did it this way (1) be­cause the pic­tures are kind of nice, (2) be­cause I was frus­trated that no one had writ­ten a good sum­mary post on Troll Bridge yet, (3) be­cause I was in a hurry. Ideally I’ll edit the images to be more suit­able for the post, al­though adding the omit­ted con­tent is a higher pri­or­ity.

• How can you (in gen­eral) con­clude some­thing by ex­am­in­ing the source code of an agent, with­out po­ten­tially im­pli­cat­ing the Halt­ing Prob­lem?