Buck comments on Alex Mallen’s Shortform

Buck 25 Mar 2026 19:58 UTC
LW: 2 AF: 2
0
AF
One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.
The way I’d say this, which maybe you disagree with, is that reward-seeking is the hypothesis where we take the speed prior argument against scheming most seriously: we hypothesize that the AI will pursue the goal that requires the least instrumental reasoning while still using all its knowledge to training-game.