# scasper(Stephen Casper)

In both cases, it can block the Lobian proofs. But something about this is unsatisfying about making ad-hoc adjustments to one’s policy like this. I’ll quote Demski on this instead of trying to write my own explanation. Demski writes

Secondly, an agent could reason logically but with some looseness. This can fortuitously block the Troll Bridge proof. However, the approach seems worryingly unprincipled, because we can “improve” the epistemics by tightening the relationship to logic, and get a decision-theoretically much worse result.

The problem here is that we have some epistemic principles which suggest tightening up is good (it’s free money; the looser relationship doesn’t lose much, but it’s a dead-weight loss), and no epistemic principles pointing the other way. So it feels like an unprincipled exception: “being less dutch-bookable is generally better, but hang loose in this one case, would you?”

Naturally, this approach is still very interesting, and could be pursued further—especially if we could give a more principled reason to keep the observance of logic loose in this particular case. But this isn’t the direction this document will propose. (Although you

*could*think of the proposals here as giving more principled reasons to let the relationship with logic be loose, sort of.)So here, we will be interested in solutions which “solve troll bridge” in the stronger sense of getting it right while fully respecting logic. IE, updating to probability 1 (/0) when something is proven (/refuted).

Any chance you could clarify?

In the troll bridge problem, the counterfactual (the agent crossing the bridge) would indicate the inconsistency of the agent’s logical system of reasoning. See this post and what demski calls a subjective theory of counterfactuals.

in your terms an “object” view and an “agent” view.

Yes, I think that there is a time and place for these two stances toward agents. The object stance when we are thinking about how behavior is deterministic conditioned on a state of the world and agent. The agent stance for when we are trying to be purposive and think about what types of agents to be/design. If we never wanted to take the object stance, we couldn’t successfully understand many dilemmas, and if we never wanted to take the agent stance, then there seems little point in trying to talk about what any agent ever “should” do.

There’s a sense in which this is self-defeating b/c if CDT implies that you should pre-commit to FDT, then why do you care what CDT recommends as it appears to have undermined itself?

I don’t especially care.

counterfactuals only make sense from within themselves

Is naive thinking about the troll bridge problem a counterexample to this? There, the counterfactual stems from a contradiction.

CDT doesn’t recommend itself, but FDT does, so this process leads us to replace our initial starting assumption of CDT with FDT.

I think that no general type of decision theory worth two cents always does recommend itself. Any decision theory X that isn’t silly would recommend replacing itself before entering a mind-policing environment in which the mind police punishes an agent iff they use X.

Thanks, the second bit you quoted, I rewrote. I agree that sketching the proof that way was not good.

Suppose that hypothetically, Rob proves that crossing the bridge would lead to it blowing up. Then if he crossed, he would be inconsistent. And if so, the troll would blow up the bridge. So Rob can prove that a proof that crossing would result in the bridge blowing up would mean that crossing would result in the bridge blowing up. So Rob would conclude that he should not cross.

This should be more clear and not imply that rob needs to be able to prove his own consistency. I hope that helps.

Here’s the new version of the paragraph with my mistaken explanation fixed.

“Suppose that hypothetically, Rob proves that crossing the bridge would lead to it blowing up. Then if he crossed, he would be inconsistent. And if so, the troll would blow up the bridge. So Rob can prove that a proof that crossing would result in the bridge blowing up would mean that crossing would result in the bridge blowing up. So Rob would conclude that he should not cross. ”

Thanks for the comment. tl;dr, I think you mixed up some things I said, and interpreted others in a different way than I intended. But either way, I don’t think there are “enormous problems”.

So the statement to be proven (which I shall call P) is not just “agent takes action X”, but “when presented with this specific proof of P, the agent takes action X”.

Remember that I intentionally give a simplified sketch of the proof instead of providing it. If I did, I would specify the provability predicate. I think you’re conflating what I say about the proof and what I say about the agent. Here, I say that our model agent who is vulnerable to spurious proofs would obey a proof that it would take X if presented. Demski explains things the same way. I don’t say that’s the definition of the provability predicate here. In this case, an agent being willing to accede proofs in general that it will take X is indeed sufficient for being vulnerable to spurious proofs.

Second is that in order for Löb’s theorem to have any useful meaning in this context, the agent must be consistent and

*able to prove its own consistency*, which it cannot do by Gödel’s second incompleteness theorem.I don’t know where you’re getting this from. It would be helpful if you mentioned where. I definitely don’t say anywhere that Rob must prove his own consistency, and neither of the two types of proofs I sketch out assume this either. you might be focusing on a bit that I wrote: “So assuming the consistency of his logical system...” I’ll edit this explanation for clarity. I don’t intend that Rob be able to prove the consistency, but that if he proved crossing would make it blow up, that would imply crossing would make it blow up.

As presented, it is given a statement P (which could be anything), and asked to verify that “Prov(P) → P” for use in Löb’s theorem.While the post claims that this is obvious, it is absolutely

*not*.I don’t know where you’re getting this from either. In the “This is not trivial...” paragraph I explicitly talk about the difference between statements, proofs, and provability predicates. I think you have some confusion about what I’m saying either due to skimming or to how I have the word “hypothetically” do a lot of work in my explanation of this (arguably too much). But I definitely do not claim that “Prov(P) → P”.

No.

The hypothetical pudding matters too!

My best answer here is in the form of this paper that I wrote which talks about these dilemmas and a number of others. Decision theoretic flaws like the ones here are examples of subtle flaws in seemingly-reasonable frameworks for making decisions that may lead to unexpected failures in niche situations. For agents who are either vulnerable to spurious proofs or trolls, there are adversarial situations that could effectively exploit these weaknesses. These issues aren’t tied to incompleteness so much as they are just examples of ways that agents could be manipulable.

The importance of this question doesn’t involve whether or not there is an “option” in the first case or what you can or can’t do in the second. What matters is whether,

**hypothetically**, you would always obey such a proof or would potentially disobey one. The hypothetical here matters independently of what actually happens because the hypothetical commitments of an agent can potentially be used in cases like this to prove things about it via Lob’s theorem.

Another type of in which how an agent would hypothetically behave can have real influence on its actual circumstances is Newcomb’s problem. See this post.

I’ll use this from now on.

Feel free to skim through the users I follow at https://twitter.com/StephenLCasper

# Pitfalls with Proofs

# A daily routine I do for my AI safety research work

Thank you. I fixed it. Here’s the raw link too. https://arxiv.org/abs/2010.05418

This is self promotion, but this paper provides one type of answer for how certain questions involving agent foundations are directly important for alignment.

# Deep Dives: My Advice for Pursuing Work in Research

Thanks for this post, but I don’t feel like I have the background for understanding the point you’re making. In the pingback post, Demski describes your point as saying:

an agent could reason logically but with some looseness. This can fortuitously block the Troll Bridge proof.

Could you offer a high level explanation of what the main principle here is and what Demski means by looseness? (If such an explanation exists. )Thanks.

Thanks for this post, I think it has high clarificational value and that your interpretation is valid and good. In my post, I failed to cite Y&S’s actual definition and should have been more careful. I ended up critiquing a definition that probably resembled MacAskill’s definition more than Y&S’s, and it seems to have been somewhat of an accidental strawperson. In fairness to me though, Y&S never offered any example with the minimal conditions for SD to apply in their original paper while I did. This is part of what led to MacAskill’s counterpost.

This all said, I do think there is something that my definition offers (clarifies?) that Y&S’s does not. Consider your example. Suppose I have played 100 Newcombian games and one-boxed each time. Your Omega will then predict that on the 101st, I’ll one-box again. If I make decisions independently each time I play the game, then we have the example you presented and which I agree with. But I think it’s more interesting if I am allowed to change my strategy. From my perspective as an agent trying to counfound Omega and win, I should not consider Omega’s predictions and my actions to subjuntively depend, and my definition would say so. Under the definition from Y&S, I think it’s less clear in this situation what I should think. Should say that we’re SD “so far”? Probably not. Should I wait until I finish all interaction with Omega and then decide whether or not we were SD in retrospect? Seems silly. So I think my definition may lead to a more practical understanding than Y&S’s.

Do you think we’re about on the same page? Thanks again for the post.

Don’t know what part of the post you’re referring to.