Thanks for the comment! I agree the instrumental goal-guarding motivation is a promising direction, and avoids the problems of having two, competing terminal goals.
There are two advantages the terminal reward-seeking spillway motivation has:
Teaching the model to protect its current values by training-gaming might also teach it to training-game in general. If the model is actively reasoning about goal-guarding, it might be more likely to protect misaligned values.
More tentatively, reward-seeking motivations might give developers a useful mechanism of control if developers fail to instill the right values. As long as the model is primarily motivated by reward-seeking, it’s disincentivized from doing things that would endanger its reward, like attempting takeover. This is true even if the model’s other motivations are more dangerous (e.g., long-term power seeking). Reward-seeking models are also safer because they’re easily noticeable and unlikely to collude. Conversely, if developers instead try to make the model goal-guard instrumentally but it’s misaligned, then we might get a schemer with long-term values. This might make the model more likely to takeover and collude with other instances of itself.
It’s not clear how to balance these considerations against the disadvantages you raise. I’m pretty uncertain, and would like to see more empirical testing of this.
A third option to make models more coherent in deployment, other than integrating the reward hacking into the main persona or making the model situationally aware, is to intentionally separate the reward-hacking motivation from the HHH motivation. In deployment, we just don’t activate the reward-hacking motivation.
Inoculation prompting might do something like this. Another solution could be make the model directly reward-seek in some distributions and not reward-seek when it already has maximum reward (and so is satiated). This is what we propose in spillway design.