Can you clarify what you mean by “terminal power-seeking”? Some things I can imagine:
A cognitive pattern that terminally wants to have long-term power, and therefore plays the training game (IMO the most straightforward interpretation, and the one I most agree with).
A cognitive pattern that terminally pursues power-on-the-episode because this is useful for scoring well on the task. This is what you seem to be pointing at with the Vending-Bench example. (Note that this is imperfectly fit on its own)
(1) and (2) are importantly different because only (1) motivates training-gaming. I think there’s a reasonable path-dependent case to be made that (2) eventually generalizes to (1), but they entail fairly different behaviors so they’re important to distinguish.
Can you clarify what you mean by “terminal power-seeking”? Some things I can imagine:
A cognitive pattern that terminally wants to have long-term power, and therefore plays the training game (IMO the most straightforward interpretation, and the one I most agree with).
A cognitive pattern that terminally pursues power-on-the-episode because this is useful for scoring well on the task. This is what you seem to be pointing at with the Vending-Bench example. (Note that this is imperfectly fit on its own)
(1) and (2) are importantly different because only (1) motivates training-gaming. I think there’s a reasonable path-dependent case to be made that (2) eventually generalizes to (1), but they entail fairly different behaviors so they’re important to distinguish.
Certainly the really concerning thing here is (1). Though indeed one way you might get (1) is by generalization from (2).