# Johannes_Treutlein(Johannes Treutlein)

Karma: 72
• I find this particularly curious since naively, one would assume that weight sharing implicitly implements a simplicity prior, so it should make optimization more likely and thus also deceptive behavior? Maybe the argument is that somehow weight sharing leaves less wiggle room for obscuring one’s reasoning process, making a potential optimizer more interpretable? But the hidden states and tied weights could still be encoding deceptive reasoning in an uninterpretable way?

• Which program is that, if I may ask?

• Wolfgang Spohn develops the concept of a “dependency equilibrium” based on a similar notion of evidential best response (Spohn 2007, 2010). A joint probability distribution is a dependency equilibrium if all actions of all players that have positive probability are evidential best responses. In case there are actions with zero probability, one evaluates a sequence of joint probability distributions such that and for all actions and . Using your notation of a probability matrix and a utility matrix, the expected utility of an action is then defined as the limit of the conditional expected utilities, (which is defined for all actions). Say is a probability matrix with only one zero column, . It seems that you can choose an arbitrary nonzero vector , to construct, e.g., a sequence of probability matrices The expected utilities in the limit for all other actions and the actions of the opponent shouldn’t be influenced by this change. So you could choose as the standard vector where is an index such that . The expected utility of would then be . Hence, this definition of best response in case there are actions with zero probability probably coincides with yours (at least for actions with positive probability—Spohn is not concerned with the question of whether a zero probability action is a best response or not).

The whole thing becomes more complicated with several zero rows and columns, but I would think it should be possible to construct sequences of distributions which work in that case as well.

# Re­quest for in­put on mul­ti­verse-wide su­per­ra­tional­ity (MSR)

14 Aug 2018 17:29 UTC
18 points
(effective-altruism.com)

# A be­hav­iorist ap­proach to build­ing phe­nomenolog­i­cal bridges

20 Nov 2017 19:36 UTC
1 point
(casparoesterheld.com)
• Thanks for your answer! This “gain” approach seems quite similar to what Wedgwood (2013) has proposed as “Benchmark Theory”, which behaves like CDT in cases with, but more like EDT in cases without causally dominant actions. My hunch would be that one might be able to construct a series of thought-experiments in which such a theory violates transitivity of preference, as demonstrated by Ahmed (2012).

I don’t understand how you arrive at a gain of 0 for not smoking as a smoke-lover in my example. I would think the gain for not smoking is higher:

.

So as long as , the gain of not smoking is actually higher than that of smoking. For example, given prior probabilities of 0.5 for either state, the equilibrium probability of being a smoke-lover given not smoking will be 0.5 at most (in the case in which none of the smoke-lovers smoke).

• From my perspective, I don’t think it’s been adequately established that we should prefer updateless CDT to updateless EDT

I agree with this.

It would be nice to have an example which doesn’t arise from an obviously bad agent design, but I don’t have one.

I’d also be interested in finding such a problem.

I am not sure whether your smoking lesion steelman actually makes a decisive case against evidential decision theory. If an agent knows about their utility function on some level, but not on the epistemic level, then this can just as well be made into a counter-example to causal decision theory. For example, consider a decision problem with the following payoff matrix:

Smoke-lover:

• Smokes:

• Killed: 10

• Not killed: −90

• Doesn’t smoke:

• Killed: 0

• Not killed: 0

Non-smoke-lover:

• Smokes:

• Killed: −100

• Not killed: −100

• Doesn’t smoke:

• Killed: 0

• Not killed: 0

For some reason, the agent doesn’t care whether they live or die. Also, let’s say that smoking makes a smoke-lover happy, but afterwards, they get terribly sick and lose 100 utilons. So they would only smoke if they knew they were going to be killed afterwards. The non-smoke-lover doesn’t want to smoke in any case.

Now, smoke-loving evidential decision theorists rightly choose smoking: they know that robots with a non-smoke-loving utility function would never have any reason to smoke, no matter which probabilities they assign. So if they end up smoking, then this means they are certainly smoke-lovers. It follows that they will be killed, and conditional on that state, smoking gives 10 more utility than not smoking.

Causal decision theory, on the other hand, seems to recommend a suboptimal action. Let be smoking, not smoking, being a smoke-lover, and being a non-smoke-lover. Moreover, say the prior probability is . Then, for a smoke-loving CDT bot, the expected utility of smoking is just

,

which is less then the certain utilons for . Assigning a credence of around to , a smoke-loving EDT bot calculates

,

which is higher than the expected utility of .

The reason CDT fails here doesn’t seem to lie in a mistaken causal structure. Also, I’m not sure whether the problem for EDT in the smoking lesion steelman is really that it can’t condition on all its inputs. If EDT can’t condition on something, then EDT doesn’t account for this information, but this doesn’t seem to be a problem per se.

In my opinion, the problem lies in an inconsistency in the expected utility equations. Smoke-loving EDT bots calculate the probability of being a non-smoke-lover, but then the utility they get is actually the one from being a smoke-lover. For this reason, they can get some “back-handed” information about their own utility function from their actions. The agents basically fail to condition two factors of the same product on the same knowledge.

Say we don’t know our own utility function on an epistemic level. Ordinarily, we would calculate the expected utility of an action, both as smoke-lovers and as non-smoke-lovers, as follows:

,

where, if () is the utility function of a smoke-lover (non-smoke-lover), is equal to . In this case, we don’t get any information about our utility function from our own action, and hence, no Newcomb-like problem arises.

I’m unsure whether there is any causal decision theory derivative that gets my case (or all other possible cases in this setting) right. It seems like as long as the agent isn’t certain to be a smoke-lover from the start, there are still payoffs for which CDT would (wrongly) choose not to smoke.

• Imagine that Omega tells you that it threw its coin a million years ago, and would have turned the sky green if it had landed the other way. Back in 2010, I wrote a post arguing that in this sort of situation, since you’ve always seen the sky being blue, and every other human being has also always seen the sky being blue, everyone has always had enough information to conclude that there’s no benefit from paying up in this particular counterfactual mugging, and so there hasn’t ever been any incentive to self-modify into an agent that would pay up … and so you shouldn’t.

I think this sort of reasoning doesn’t work if you also have a precommitment regarding logical facts. Then you know the sky is blue, but you don’t know what that implies. When Omega informs you about the logical connection between sky color, your actions, and your payoff, then you won’t update on this logical fact. This information is one implication away from the logical prior you precommitted yourself to. And the best policy given this prior, which contains information about sky color, but not about this blackmail, is not to pay: not paying will a priori just change the situation in which you will be blackmailed (hence, what blue sky color means), but not the probability of a positive intelligence explosion in the first place. Knowing or not knowing the color of the sky doesn’t make a difference, as long as we don’t know what it implies.

(HT Lauro Langosco for pointing this out to me.)

# An­thropic un­cer­tainty in the Ev­i­den­tial Black­mail problem

25 May 2017 10:10 UTC
2 points
(casparoesterheld.com)

# An­thropic un­cer­tainty in the Ev­i­den­tial Black­mail problem

14 May 2017 16:43 UTC
8 points
(casparoesterheld.com)

It’s not a given that you can easily observe your existence.

It took me a while to understand this. Would you say that for example in the Evidential Blackmail, you can never tell whether your decision algorithm is just being simulated or whether you’re actually in the world where you received the letter, because both times, the decision algorithms receive exactly the same evidence? So in this sense, after updating on receiving the letter, both worlds are still equally likely, and only via your decision do you find out which of those worlds are the simulated ones and which are the real ones. One can probably generalize this principle: you can never differentiate between different instantiations of your decision algorithm that have the same evidence. So when you decide what action to output conditional on receiving some sense data, you always have to decide based on your prior probabilities. Normally, this works exactly as if you would first update on this sense data and then decide. But sometimes, e.g. if your actions in one world make a difference to the other world via a simulation, then it makes a difference. Maybe if you assign anthropic probabilities to either being a “logical zombie” or the real you, then the result would be like UDT even with updating?

What I still don’t understand is how this motivates updatelessness with regard to anthropic probabilities (e.g. if I know that I have a low index number, or in Psy Kosh’s problem, if I already know I’m the decider). I totally get how it makes sense to precommit yourself and how one should talk about decision problems instead of probabilities, how you should reason as if you’re all instantiations of your decision algorithm at once, etc. Also, intuitively I agree with sticking with the priors. But somehow I can’t get my head around what exactly is wrong about the update. Why is it wrong to assign more “caring energy” to the world in which some kind of observation that I make would have been more probable? Is it somehow wrong that it “would have been more probable”? Did I choose the wrong reference classes? Is it because in these problems, too, the worlds influence each other, so that you have to consider the impact that your decision would have on the other world as well?

Edit: Never mind, I think http://​​lesswrong.com/​​lw/​​jpr/​​sudt_a_toy_decision_theory_for_updateless/​​ kind of answers my question :)

• I agree with all of this, and I can’t understand why the Smoking Lesion is still seen as the standard counterexample to EDT.

Regarding the blackmail letter: I think that in principle, it should be possible to use a version of EDT that also chooses policies based on a prior instead of actions based on your current probability distribution. That would be “updateless EDT”, and I think it wouldn’t give in to Evidential Blackmail. So I think rather than an argument against EDT, it’s an argument in favor of updatelessness.

• Thanks for the link! What I don’t understand is how this works in the context of empirical and logical uncertainty. Also, it’s unclear to me how this approach relates to Bayesian conditioning. E.g. if the sentence “if a holds, than o holds” is true, doesn’t this also mean that P(o|a)=1? In that sense, proof-based UDT would just be an elaborate specification of how to assign these conditional probabilities “from the viewpoint of the original position”, so with updatelessness, and in the context of full logical inference and knowledge of the world, including knowledge about one’s own decision algorithm. I see how this is useful, but don’t understand how it would at any point contradict normal Bayesian conditioning.

As to your first question: if we ignore problems that involve updatelessness (or if we just stipulate that EDT always had the opportunity to precommit), I haven’t been able to find any formally specified problems where EDT and UDT diverge.

I think Caspar Oesterheld’s and my flavor of EDT would be ordinary EDT with some version of updatelessness. I’m not sure if this works, but if it turns out to be identical to UDT, then I’m not sure which of the two is the better specified or easier to formalize one. According to the language in Arbital’s LDT article, my EDT would differ from UDT only insofar as instead of some logical conditioning, we use ordinary Bayesian conditioning. So (staying in the Arbital framework), it could look something like this (P stands for whatever prior probability distribution you care about):

(So I’m not even sure what CDT is supposed to do here, since it’s not clear that the bet is really on the past state of the world and not on truth of a proposition about the future state of the world.)

Hmm, good point. The truth of the proposition is evaluated on basis of Alice’s action, which she can causally influence. But we could think of a Newcomblike scenario in which someone made a perfect prediction a 100 years ago and put down a note about what state the world was in at that time. Now instead of checking Alice’s action, we just check this note to evaluate whether the proposition is true. I think then it’s clear that CDT would “two-box”.

Given that, I don’t see what role “LDT’s algorithm already existed yesterday” plays here, and I think it’s misleading to state that “it can change yesterday’s world and make the proposition true”. Instead it can make the proposition true without changing yesterday’s world, by ensuring that yesterday’s world was always such that the proposition is true. There is no change, yesterday’s world was never different and the proposition was never false.

Sorry for the fuzzy wording! I agree that “change” is not a good terminology. I was thinking about TDT and a causal graph. In that context, it might have made sense to say that TDT can “determine the output” of the decision nodes, but not that of the nature nodes that have a causal influence on the decision nodes?

Following from the preceding point, it doesn’t matter when the past state of the world is, since we are not trying to influence it, we are instead trying to influence its consequences, which are in the future.

OK, if I interpret that correctly, you would say that our proposition is also a program that references Alice’s decision algorithm, and hence we can just determine that program’s output the same way we can determine our own decision. I am totally fine with that. If we can expand this principle to all the programs that somehow reference our decision algorithms, I would be curious whether there are still differences left between this and evidential counterfactuals.

Take the thought experiment in this post, for instance: Imagine you’re an agent that always chooses the action “take the red box”. Now there is a program that checks whether there will be cosmic rays, and if so, then it changes your decision algorithm to one that outputs “take the green box”. Of course, you can still “influence” your output like all regular humans, and you can thus in some sense also influence the output of the program that changed you. By extension, you can even influence whether or not the output of the program “outer space” is “gamma rays” or “no gamma rays”. If I understand your answers to my Coin Flip Creation post correctly, this formulation would make the problem into a kind of anthropic problem again, where the algorithm would at one point “choose to output red” in order to be instantiated into the world without gamma rays. Would you agree with this, or did I get something wrong?

# “Bet­ting on the Past” – a de­ci­sion prob­lem by Arif Ahmed

7 Feb 2017 21:14 UTC
5 points