Johannes Treutlein 28 Mar 2023 18:29 UTC
LW: 11 AF: 8
0
AF
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it’s unclear whether that pointer is simpler than a very simple misaligned goal.

Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.

Anthropic uncertainty in the Evidential Blackmail problem

Johannes Treutlein14 May 2017 16:43 UTC

10 points

1 comment1 min readLW link

(casparoesterheld.com)

Johannes Treutlein 29 Jun 2023 0:02 UTC
LW: 10 AF: 7
AF
on: Acausal trade: being unusual
Fixed links to all the posts in the sequence:

Johannes Treutlein 29 Jun 2023 0:01 UTC
LW: 10 AF: 7
AF
on: Acausal trade: conclusion: theory vs practice
Fixed links to all the posts in the sequence:

Johannes Treutlein 29 Jun 2023 0:01 UTC
LW: 10 AF: 7
AF
on: Acausal trade: different utilities, different trades
Fixed links to all the posts in the sequence:

Johannes Treutlein 29 Jun 2023 0:01 UTC
LW: 10 AF: 7
AF
on: Acausal trade: trade barriers
Fixed links to all the posts in the sequence:

Johannes Treutlein 29 Jun 2023 0:01 UTC
LW: 10 AF: 7
AF
on: Acausal trade: full decision algorithms
Fixed links to all the posts in the sequence:

Johannes Treutlein 29 Jun 2023 0:00 UTC
LW: 10 AF: 7
AF
on: Acausal trade: double decrease
Fixed links to all the posts in the sequence:

Johannes Treutlein 9 Feb 2023 2:19 UTC
LW: 10 AF: 5
3
AF
on: Trying to Make a Treacherous Mesa-Optimizer
I like the idea behind this experiment, but I find it hard to tell from this write-up what is actually going on. I.e., what is exactly the training setup, what is exactly the model, which parts are hard-coded and which parts are learned? Why is it a weirdo janky thing instead of some other standard model or algorithm? It would be good if this was explained more in the post (it is very effortful to try to piece this together by going through the code). Right now I have a hard time making any inferences from the results.

“Betting on the Past” – a decision problem by Arif Ahmed

Johannes Treutlein7 Feb 2017 21:14 UTC

7 points

6 comments1 min readLW link

(casparoesterheld.com)

Johannes Treutlein 11 Mar 2024 18:42 UTC
6 points
0
in reply to: Erik Jenner’s comment on: ejenner’s Shortform
How much time do you think there is between “ability to automate” and “actually this has been automated”? Are your numbers for actual automation, or just ability? I personally would agree to your numbers if they are about ability to automate, but I think it will take much longer to actually automate, due to people’s inertia and normal regulatory hurdles (though I find it confusing to think about, because we might have vastly superhuman AI and potentially loss of control before everything is actually automated.)

Johannes Treutlein 21 Dec 2022 17:47 UTC
LW: 6 AF: 4
2
AF
in reply to: Vaniver’s comment on: Proper scoring rules don’t guarantee predicting fixed points
Thanks for your comment!

Your interpretation sounds right to me. I would add that our result implies that it is impossible to incentivize honest reports in our setting. If you want to incentivize honest reports when $f$ is constant, then you have to use a strictly proper scoring rule (this is just the definition of “strictly proper”). But we show for any strictly proper scoring rule that there is a function $f$ such that a dishonest prediction is optimal.

Proposition 13 shows that it is possible to “tune” scoring rules to make optimal predictions very close to honest ones (at least in L1-distance).

I think for ‘self-fulfilling prophecy’ I would also expect there to be a counterfactual element—if I say the sun will rise tomorrow and it rises tomorrow, this isn’t a self-fulfilling prophecy because the outcome isn’t reliant on expectations about the outcome.

Yes, that is fair. To be faithful to the common usage of the term, one should maybe require at least two possible fixed points (or points that are somehow close to fixed points). The case with a unique fixed point is probably also safer, and worries about “self-fulfilling prophecies” don’t apply to the same degree.

Johannes Treutlein 10 Jul 2017 17:48 UTC
LW: 6 AF: 5
AF
on: Smoking Lesion Steelman
From my perspective, I don’t think it’s been adequately established that we should prefer updateless CDT to updateless EDT

I agree with this.

It would be nice to have an example which doesn’t arise from an obviously bad agent design, but I don’t have one.

I’d also be interested in finding such a problem.

I am not sure whether your smoking lesion steelman actually makes a decisive case against evidential decision theory. If an agent knows about their utility function on some level, but not on the epistemic level, then this can just as well be made into a counter-example to causal decision theory. For example, consider a decision problem with the following payoff matrix:

Smoke-lover:
- Smokes:
  - Killed: 10
  - Not killed: −90
- Doesn’t smoke:
  - Killed: 0
  - Not killed: 0
Non-smoke-lover:
- Smokes:
  - Killed: −100
  - Not killed: −100
- Doesn’t smoke:
  - Killed: 0
  - Not killed: 0
For some reason, the agent doesn’t care whether they live or die. Also, let’s say that smoking makes a smoke-lover happy, but afterwards, they get terribly sick and lose 100 utilons. So they would only smoke if they knew they were going to be killed afterwards. The non-smoke-lover doesn’t want to smoke in any case.

Now, smoke-loving evidential decision theorists rightly choose smoking: they know that robots with a non-smoke-loving utility function would never have any reason to smoke, no matter which probabilities they assign. So if they end up smoking, then this means they are certainly smoke-lovers. It follows that they will be killed, and conditional on that state, smoking gives 10 more utility than not smoking.

Causal decision theory, on the other hand, seems to recommend a suboptimal action. Let $a_{1}$ be smoking, $a_{2}$ not smoking, $S_{1}$ being a smoke-lover, and $S_{2}$ being a non-smoke-lover. Moreover, say the prior probability $P (S_{1})$ is $0.5$ . Then, for a smoke-loving CDT bot, the expected utility of smoking is just

$E [U | a_{1}] = P (S_{1}) \cdot U (S_{1} \land a_{1}) + P (S_{2}) \cdot U (S_{2} \land a_{1}) = 0.5 \cdot 10 + 0.5 \cdot (- 90) = - 40$ ,

which is less then the certain $0$ utilons for $a_{2}$ . Assigning a credence of around $1$ to $P (S_{1} | a_{1})$ , a smoke-loving EDT bot calculates

$E [U | a_{1}] = P (S_{1} | a_{1}) \cdot U (S_{1} \land a_{1}) + P (S_{2} | a_{1}) \cdot U (S_{2} \land a_{1}) \approx 1 \cdot 10 + 0 \cdot (- 90) = 10$ ,

which is higher than the expected utility of $a_{2}$ .

The reason CDT fails here doesn’t seem to lie in a mistaken causal structure. Also, I’m not sure whether the problem for EDT in the smoking lesion steelman is really that it can’t condition on all its inputs. If EDT can’t condition on something, then EDT doesn’t account for this information, but this doesn’t seem to be a problem per se.

In my opinion, the problem lies in an inconsistency in the expected utility equations. Smoke-loving EDT bots calculate the probability of being a non-smoke-lover, but then the utility they get is actually the one from being a smoke-lover. For this reason, they can get some “back-handed” information about their own utility function from their actions. The agents basically fail to condition two factors of the same product on the same knowledge.

Say we don’t know our own utility function on an epistemic level. Ordinarily, we would calculate the expected utility of an action, both as smoke-lovers and as non-smoke-lovers, as follows:

$E [U | a] = P (S_{1} | a) \cdot E [U | S_{1}, a] + P (S_{2} | a) \cdot E [U | S_{2}, a]$ ,

where, if $U_{1}$ ( $U_{2}$ ) is the utility function of a smoke-lover (non-smoke-lover), $E [U | S_{i}, a]$ is equal to $E [U_{i} | a]$ . In this case, we don’t get any information about our utility function from our own action, and hence, no Newcomb-like problem arises.

I’m unsure whether there is any causal decision theory derivative that gets my case (or all other possible cases in this setting) right. It seems like as long as the agent isn’t certain to be a smoke-lover from the start, there are still payoffs for which CDT would (wrongly) choose not to smoke.
What links here?
- Smoking Lesion Steelman II by abramdemski (2 Oct 2017 22:11 UTC; 2 points)