Johannes Treutlein(Johannes Treutlein)
Proper scoring rules don’t guarantee predicting fixed points
Report on modeling evidential cooperation in large worlds
Stop-gradients lead to fixed point predictions
Training goals for large language models
Did EDT get it right all along? Introducing yet another medical Newcomb problem
Request for input on multiverse-wide superrationality (MSR)
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it’s unclear whether that pointer is simpler than a very simple misaligned goal.
Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.
Anthropic uncertainty in the Evidential Blackmail problem
Fixed links to all the posts in the sequence:
Fixed links to all the posts in the sequence:
Fixed links to all the posts in the sequence:
Fixed links to all the posts in the sequence:
Fixed links to all the posts in the sequence:
Fixed links to all the posts in the sequence:
I like the idea behind this experiment, but I find it hard to tell from this write-up what is actually going on. I.e., what is exactly the training setup, what is exactly the model, which parts are hard-coded and which parts are learned? Why is it a weirdo janky thing instead of some other standard model or algorithm? It would be good if this was explained more in the post (it is very effortful to try to piece this together by going through the code). Right now I have a hard time making any inferences from the results.
“Betting on the Past” – a decision problem by Arif Ahmed
How much time do you think there is between “ability to automate” and “actually this has been automated”? Are your numbers for actual automation, or just ability? I personally would agree to your numbers if they are about ability to automate, but I think it will take much longer to actually automate, due to people’s inertia and normal regulatory hurdles (though I find it confusing to think about, because we might have vastly superhuman AI and potentially loss of control before everything is actually automated.)
Thanks for your comment!
Your interpretation sounds right to me. I would add that our result implies that it is impossible to incentivize honest reports in our setting. If you want to incentivize honest reports when is constant, then you have to use a strictly proper scoring rule (this is just the definition of “strictly proper”). But we show for any strictly proper scoring rule that there is a function such that a dishonest prediction is optimal.
Proposition 13 shows that it is possible to “tune” scoring rules to make optimal predictions very close to honest ones (at least in L1-distance).
I think for ‘self-fulfilling prophecy’ I would also expect there to be a counterfactual element—if I say the sun will rise tomorrow and it rises tomorrow, this isn’t a self-fulfilling prophecy because the outcome isn’t reliant on expectations about the outcome.
Yes, that is fair. To be faithful to the common usage of the term, one should maybe require at least two possible fixed points (or points that are somehow close to fixed points). The case with a unique fixed point is probably also safer, and worries about “self-fulfilling prophecies” don’t apply to the same degree.
From my perspective, I don’t think it’s been adequately established that we should prefer updateless CDT to updateless EDT
I agree with this.
It would be nice to have an example which doesn’t arise from an obviously bad agent design, but I don’t have one.
I’d also be interested in finding such a problem.
I am not sure whether your smoking lesion steelman actually makes a decisive case against evidential decision theory. If an agent knows about their utility function on some level, but not on the epistemic level, then this can just as well be made into a counter-example to causal decision theory. For example, consider a decision problem with the following payoff matrix:
Smoke-lover:
-
Smokes:
Killed: 10
Not killed: −90
-
Doesn’t smoke:
Killed: 0
Not killed: 0
Non-smoke-lover:
-
Smokes:
Killed: −100
Not killed: −100
-
Doesn’t smoke:
Killed: 0
Not killed: 0
For some reason, the agent doesn’t care whether they live or die. Also, let’s say that smoking makes a smoke-lover happy, but afterwards, they get terribly sick and lose 100 utilons. So they would only smoke if they knew they were going to be killed afterwards. The non-smoke-lover doesn’t want to smoke in any case.
Now, smoke-loving evidential decision theorists rightly choose smoking: they know that robots with a non-smoke-loving utility function would never have any reason to smoke, no matter which probabilities they assign. So if they end up smoking, then this means they are certainly smoke-lovers. It follows that they will be killed, and conditional on that state, smoking gives 10 more utility than not smoking.
Causal decision theory, on the other hand, seems to recommend a suboptimal action. Let be smoking, not smoking, being a smoke-lover, and being a non-smoke-lover. Moreover, say the prior probability is . Then, for a smoke-loving CDT bot, the expected utility of smoking is just
,
which is less then the certain utilons for . Assigning a credence of around to , a smoke-loving EDT bot calculates
,
which is higher than the expected utility of .
The reason CDT fails here doesn’t seem to lie in a mistaken causal structure. Also, I’m not sure whether the problem for EDT in the smoking lesion steelman is really that it can’t condition on all its inputs. If EDT can’t condition on something, then EDT doesn’t account for this information, but this doesn’t seem to be a problem per se.
In my opinion, the problem lies in an inconsistency in the expected utility equations. Smoke-loving EDT bots calculate the probability of being a non-smoke-lover, but then the utility they get is actually the one from being a smoke-lover. For this reason, they can get some “back-handed” information about their own utility function from their actions. The agents basically fail to condition two factors of the same product on the same knowledge.
Say we don’t know our own utility function on an epistemic level. Ordinarily, we would calculate the expected utility of an action, both as smoke-lovers and as non-smoke-lovers, as follows:
,
where, if () is the utility function of a smoke-lover (non-smoke-lover), is equal to . In this case, we don’t get any information about our utility function from our own action, and hence, no Newcomb-like problem arises.
I’m unsure whether there is any causal decision theory derivative that gets my case (or all other possible cases in this setting) right. It seems like as long as the agent isn’t certain to be a smoke-lover from the start, there are still payoffs for which CDT would (wrongly) choose not to smoke.
-
Since the links above are broken, here are links to all the other posts in the sequence:
This post
Acausal trade: double decrease
Acausal trade: universal utility, or selling non-existence insurance too late
Acausal trade: full decision algorithms
Acausal trade: trade barriers
Acausal trade: different utilities, different trades
Acausal trade: being unusual
Acausal trade: conclusion: theory vs practice