# Troll Bridge

*All of the results in this post, and most of the informal observations/interpretations, are due to Sam Eisenstat.*

Troll Bridge is a decision problem which has been floating around for a while, but which has lacked a good introductory post. The original post gives the essential example, but it lacks the “troll bridge” story, which (1) makes it hard to understand, since it is just stated in mathematical abstraction, and (2) makes it difficult to find if you search for “troll bridge”.

The basic idea is that you want to cross a bridge. However, there is a troll who will blow up the bridge with you on it, * if* (and only if) you cross it “for a dumb reason” — for example, due to unsound logic. You can get to where you want to go by a worse path (through the stream). This path is better than being blown up, though.

We apply a Löbian proof to show not only that you choose not to cross, but furthermore, that your counterfactual reasoning is confident that the bridge would have blown up if you had crossed. This is supposed to be a counterexample to various proposed notions of counterfactual, and for various proposed decision theories.

The pseudocode for the environment (more specifically, the utility gained from the environment) is as follows:

IE, if the agent crosses the bridge and is inconsistent, then U=-10. (□⊥ means “PA proves an inconsistency”.) Otherwise, if the agent crosses the bridge, U=+10. If neither of these (IE, the agent does not cross the bridge), U=0.

The pseudocode for the agent could be as follows:

This is a little more complicated, but the idea is supposed to be that you search for every “action implies utility” pair, and take the action for which you can prove the highest utility (with some tie-breaking procedure). Importantly, this is the kind of proof-based decision theory which eliminates spurious counterfactuals in 5-and-10 type problems. It isn’t that easy to trip up with Löbian proofs. *(Historical/terminological note: This decision theory was initially called MUDT, and is still sometimes referred to in that way. However, I now often call it proof-based decision theory, because it isn’t centrally a UDT. “Modal DT” (MDT) would be reasonable, but the modal operator involved is the “provability” operator, so “proof-based DT” seems more direct.)*

Now, the proof:

Reasoning within PA (ie, the logic of the agent):

Suppose the agent crosses.

Further suppose that the agent proves that crossing implies U=-10.

Examining the source code of the agent, because we’re assuming the agent crosses, either PA proved that crossing implies U=+10, or it proved that crossing implies U=0.

So, either way, PA is inconsistent—by way of 0=-10 or +10=-10.

So the troll actually blows up the bridge, and really, U=-10.

Therefore (popping out of the second assumption), if the agent proves that crossing implies U=-10, then in fact crossing implies U=-10.

By Löb’s theorem, crossing really implies U=-10.

So (since we’re still under the assumption that the agent crosses), U=-10.

So (popping out of the assumption that the agent crosses), the agent crossing implies U=-10.

Since we proved all of this in PA, the agent proves it, and proves no better utility in addition (unless PA is truly inconsistent). On the other hand, it will prove that not crossing gives it a safe U=0. So it will in fact not cross.

The paradoxical aspect of this example is not that the agent doesn’t cross—it makes sense that a proof-based agent can’t cross a bridge whose safety is dependent on the agent’s own logic being consistent, since proof-based agents can’t know whether their logic is consistent. Rather, the point is that the agent’s “counterfactual” reasoning looks crazy. Arguably, the agent should be uncertain of what happens if it crosses the bridge, rather than certain that the bridge would blow up. Furthermore, the agent is reasoning as if it can control whether PA is consistent, which is arguably wrong.

## Analogy to Smoking Lesion

One interpretation of this thought-experiment is that it shows proof-based decision theory to be essentially a version of EDT, in that it has EDT-like behavior for Smoking Lesion. The analogy to Smoking Lesion is relatively strong:

An agent is at risk of having a significant internal issue. (In Smoking Lesion, it’s a medical issue. In Troll Bridge, it is logical inconsistency.)

The internal issue would bias the agent toward a particular action. (In Smoking Lesion, the agent smokes. In Troll Bridge, an inconsistent agent crosses the bridge.)

The internal issue also causes some imagined practical problem for the agent. (In Smoking Lesion, the lesion makes one more likely to get cancer. In Troll Bridge, the inconsistency would make the troll blow up the bridge.)

There is a chain of reasoning which combines these facts to stop the agent from taking the action. (In smoking lesion, EDT refuses to smoke due to the correlation with cancer. In Troll Bridge, the proof-based agent refuses to cross the bridge because of a Löbian proof that crossing the bridge leads to disaster.)

We intuitively find the conclusion nonsensical. (It seems the EDT agent should smoke; it seems the proof-based agent should not expect the bridge to explode.)

Indeed, the analogy to smoking lesion seems to strengthen the final point—that the counterfactual reasoning is wrong.

I plan to revise this post at some point, because I’m leaving some stuff out. Planned addendums:

It might seem like the agent doesn’t cross the bridge because of the severe risk aversion of proof-based DT. However, there is a version of the argument which works for agents with probabilistic reasoning, illustrating that the problem is not solved by assigning a low probability to the inconsistency of PA. Agents will not cross the bridge even if the cost of the explosion is extremely low, just slightly worse than not crossing. This makes the problem look much more severe, because it intuitively seems as if putting low probability on inconsistency should not only block the “weird” counterfactual, but actually allow an agent to cross the bridge.

Furthermore, there’s a version which works for logical induction. This shows that the problem is not sensitive to the details of how one handles logical uncertainty, and indeed, is still present for the best theory of logical uncertainty we currently have.

Written slightly differently, the reasoning seems sane: Suppose I cross. I must have proven it’s a good idea. Aka I proved that I’m consistent. Aka I’m inconsistent. Aka the bridge blows up. Better not cross.

I agree with your English characterization, and I also agree that it isn’t really obvious that the reasoning is pathological. However, I don’t think it is so obviously sane, either.

It seems like counterfactual reasoning about alternative actions should avoid going through “I’m obviously insane” in almost every case; possibly in every case. If you think about what would happen if you made a particular chess move, you need to divorce the consequences from any “I’m obviously insane in that scenario, so the rest of my moves in the game will be terrible” type reasoning. You CAN’T assess that making a move would be insane UNTIL you reason out the consequences w/o any presumption of insanity; otherwise, you might end up avoiding a move

onlybecause it looks insane (and it looks insane only because you avoid it, so you think you’ve gone mad if you take it). This principle seems potentially strong enough that you’d want to apply it to the Troll Bridge case as well, even though in Troll Bridge it won’t actually help us make the right decision (it just suggests that expecting the bridge to blow up isn’t a legit counterfactual).Also, counterfactuals which predict that the bridge blows up seem to be saying that the agent can control whether PA is consistent or inconsistent. That might be considered unrealistic.

Troll Bridge is a rare case where agents that require proof to take action can prove they would be insane to take some action before they’ve thought through its consequences. Can you show how they could unwisely do this in chess, or some sort of Troll Chess?

I don’t see how this agent seems to control his sanity. Does the agent who jumps off a roof iff he can (falsely) prove it wise choose whether he’s insane by choosing whether he jumps?

The agent in Troll Bridge thinks that it can make itself insane by crossing the bridge. (Maybe this doesn’t answer your question?)

I make no claim that this sort of case is common. Scenarios where it comes up and is relevant to X-risk might involve alien superintelligences trolling human-made AGI. But it isn’t exactly high on my list of concerns. The question is more about whether particular theories of counterfactual are right. Troll Bridge might be “too hard” in some sense—we may just have to give up on it. But, generally, these weird philosophical counterexamples are more about pointing out problems. Complex real-life situations are difficult to deal with (in terms of reasoning about what a particular theory of counterfactuals will actually do), so we check simple examples, even if they’re outlandish, to get a better idea of what the counterfactuals are doing in general.

Correct. I am trying to pin down exactly what you mean by an agent controlling a logical statement. To that end, I ask whether an agent that takes an action iff a statement is true controls the statement through choosing whether to take the action. (“The Killing Curse doesn’t crack your soul. It just takes a cracked soul to cast.”)

Perhaps we could equip logic with a “causation” preorder such that all tautologies are equivalent, causation implies implication, and whenever we define an agent, we equip its control circuits with causation. Then we could say that A doesn’t cross the bridge because it’s not insane. (I perhaps contentiously assume that insanity and proving sanity are causally equivalent.)

If we really wanted to, we could investigate the agent that only accepts utility proofs that don’t go causally backwards. (Or rather, it requires that its action provably causes the utility.)

You claimed this reasoning is unwise in chess. Can you give a simple example illustrating this?

I think the counterfactuals used by the agent are the correct counterfactuals for someone else to use while reasoning about the agent from the outside, but not the correct counterfactuals for the agent to use while deciding what to do. After all, knowing the agent’s source code, if you see it start to cross the bridge, it is correct to infer that it’s reasoning is inconsistent, and you should expect to see the troll blow up the bridge. But while deciding what to do, the agent should be able to reason about purely causal effects of its counterfactual behavior, screening out other logical implications.

Disagree that that’s what’s happening. The link between the consistency of the reasoning system and the behavior of the agent is because the consistency of the reasoning system controls the agent’s behavior, rather than the other way around. Since the agent is selecting outcomes based on their consequences, it does make sense to speak of the agent choosing actions to some extent, but I think speaking of logical implications of the agent’s actions on the consistency of formal systems as “controlling” the consistency of the formal system seems like an inappropriate attribution of agency to me.

I’m having difficulty following the line of the proof beginning “so, either way, PA is inconsistent”. We have □(A=cross⇒U=−10) and □(A=cross⇒U≠−10), which together imply that □(A≠cross), but I’m not immediately seeing how this leads to □⊥?

Ah, got there. From □(A≠cross), we get specifically □(A=cross⇒U=10) and thus □□(A=cross⇒U=10). But we have □(A=cross⇒U=10)⇒A=cross directly as a theorem (axiom?) about the behaviour of A, and we can lift this to □□(A=cross⇒U=10)⇒□(A=cross), so □(A=cross) also and thus □⊥.

Seems to me that if an agent with a reasonable heuristic for logical uncertainty came upon this problem, and was confident but not certain of its consistency, it would simply cross because expected utility would be above zero, which is a reason that doesn’t betray an inconsistency. (Besides, if it survived it would have good 3rd party validation of its own consistency, which would probably be pretty useful.)

I agree that “it seems that it should”. I’ll try and eventually edit the post to show why this is (at least) more difficult to achieve than it appears. The short version is that a proof is still a proof for a logically uncertain agent; so, if the Löbian proof did still work, then the agent would update to 100% believing it, eliminating its uncertainty; therefore, the proof still works (via its Löbian nature).

The proof doesn’t work on a logically uncertain agent. The logic fails here:

A logically uncertain agent does not need a proof of either of those things in order to cross, it simply needs a positive expectation of utility, for example a heuristic which says that there’s a 99% chance crossing implies U=+10.

Though you did say there’s a version which still works for logical induction. Do you have a link to where I can see that version of the argument?

Edit: Now I still see the logic. On the assumption that the agent crosses but also proves that U=-10, the agent must have a contradiction somewhere, because that, and the logical uncertainty agents I’m aware of have a contradiction upon proving U=-10 because they prove that they will not cross, and then immediately cross in a maneuver meant to prevent exactly this kind of problem.

Wait but proving crossing implies U=-10 does not mean that prove they will not cross, exactly because they might still cross, if they have a contradiction.

God this stuff is confusing. I still don’t think the logic holds though.

Viewed from the outside, in the logical counterfactual where the agent crosses, PA can prove its own consistency, and so is inconsistent. There is a model of PA in which “PA proves False”. Having counterfactualed away all the other models, these are the ones left. Logical counterfactualing on any statement that can’t be proved or disproved by a theory should produce the same result as adding it as an axiom. Ie logical counterfactualing ZF on choice should produce ZFC.

The only unreasonableness here comes from the agent’s worst case optimizing behaviour. This agent is excessively cautious. A logical induction agent, with PA as a deductive process will assign some prob P strictly between 0 and 1 to “PA is consistent”. Depending on which version of logical induction you run, and how much you want to cross the bridge, crossing might be worth it. (the troll is still blowing up the bridge iff PA proves False)

A logical counterfactual where you don’t cross the bridge is basically a counterfactual world where your design of logical induction assigns lower prob to “PA is consistent”. In this world it doesn’t cross and gets zero.

The alternative is a logical factual where it expects +ve util.

So if we make logical induction like crossing enough, and not mind getting blown up much, it crosses the bridge. Lets reverse this. An agent really doesn’t want blown up.

In the counterfactual world where it crosses, logical induction assigns more prob to “PA is consistant”. The expected utility procedure has to use its real probability distribution, not ask the counterfactual agent for its expected util.

I am not sure what happens after this, I think you still need to think about what you do in impossible worlds. Still working it out.

I don’t totally disagree, but see my reply to Gurkenglas as well as my reply to Andrew Sauer. Uncertainty doesn’t really save us, and the behavior isn’t really due to the worst-case-minimizing behavior. It can end up doing the same thing even if getting blown up is only slightly worse than not crossing! I’ll try to edit the post to add the argument wherein logical induction fails eventually (maybe not for a week, though). I’m much more inclined to say “Troll Bridge is too hard; we can’t demand so much of our counterfactuals” than I am to say “the counterfactual is actually perfectly reasonable” or “the problem won’t occur if we have reasonable uncertainty”.

“(It makes sense that) A proof-based agent can’t cross a bridge whose safety is dependent on the agent’s own logic being consistent, since proof-based agents can’t know whether their logic is consistent.”

If the agent crosses the bridge, then the agent knows itself to be consistent.

The agent cannot know whether they are consistent.

Therefore, crossing the bridge implies an inconsistency (they know themself to be consistent, even though that’s impossible.)

The counterfactual reasoning seems quite reasonable to me.

See my reply to Gurkenglass.

Does this way of writing “if” mean the same thing as “iff”, i.e. “if and only if”?

No, but I probably should have said “iff” or “if and only if”. I’ll edit.

I’m finding some of the text in the comic slightly hard to read.

Yep, sorry. The illustrations were not actually originally meant for publication; they’re from my personal notes. I did it this way (1) because the pictures are kind of nice, (2) because I was frustrated that no one had written a good summary post on Troll Bridge yet, (3) because I was in a hurry. Ideally I’ll edit the images to be more suitable for the post, although adding the omitted content is a higher priority.

How can you (in general) conclude something by examining the source code of an agent, without potentially implicating the Halting Problem?

Nothing stops the Halting problem being solved in particular instances. I can prove that some agent halts, and so can it. See FairBot in Robust Cooperation in the Prisoner’s Dilemma.

In this case, we have (by assumption) an output of the program, so we just look at the cases where the program gives that output.