Just an incomplete comment about The assumptions that make reward-seekers plausible also make fitness-seekers plausible: I think a central question is whether X-seeking gives you a compressed policy vs “optimal kludge”. That is: if it’s just as hard to learn the optimal policy if I’m an X-seeker as it is to learn the optimal kludge if I’m not an X-seeker, then it seems like I’m unlikely to learn X-seeking (or X-seeking is at best no more likely than a whole host of other possible behavioural spandrels, which implies exactly the same thing).
I think the argument that X-seeking incentivises optimal behaviour is some reason to think it might be compressive, but not obviously a very strong one: if all X-seeking gets you is “I should do well on evals” then that’s a very small piece of policy it’s compressing, not obviously even enough to pay its own cost. That is, the extra bit of policy “I should seek X” seems like it could easily be longer or have lower prior probability than “I should do well on evals”. If “I should seek X” helped further in actually doing well on evals then I think there’s a stronger argument to be made but...I just need to think about this more, it’s not immediately apparent what that actually looks like.
Just an incomplete comment about The assumptions that make reward-seekers plausible also make fitness-seekers plausible: I think a central question is whether X-seeking gives you a compressed policy vs “optimal kludge”. That is: if it’s just as hard to learn the optimal policy if I’m an X-seeker as it is to learn the optimal kludge if I’m not an X-seeker, then it seems like I’m unlikely to learn X-seeking (or X-seeking is at best no more likely than a whole host of other possible behavioural spandrels, which implies exactly the same thing).
I think the argument that X-seeking incentivises optimal behaviour is some reason to think it might be compressive, but not obviously a very strong one: if all X-seeking gets you is “I should do well on evals” then that’s a very small piece of policy it’s compressing, not obviously even enough to pay its own cost. That is, the extra bit of policy “I should seek X” seems like it could easily be longer or have lower prior probability than “I should do well on evals”. If “I should seek X” helped further in actually doing well on evals then I think there’s a stronger argument to be made but...I just need to think about this more, it’s not immediately apparent what that actually looks like.