Thanks for your paper. I’m really glad people are investigating scheming behaviors empirically, and your environments seem like among the best out there. This is really important work.
I agree your ablations in Appendix M are persuasive and that something scheme-y and misaligned is definitely going on. But I took it from this figure that this behavior is fairly rare?
(I’m looking at the Sandbagging QA (No Goal) row here).
I think scheming that happens rarely is not well described as “scheming” in the classical sense, and is less concerning. But is still dangerous, and I agree is also not consistent with “instruction following.”
The main reason rare scheming is less concerning is that, in my experience, rare-schemers tend to report on each other most of the time. So untrusted monitoring usually mitigates most of the risk.
> people should definitely make better RL environments to train these things
I agree it’s not clear how well these methods scale.
Hey Bronson,
Thanks for your paper. I’m really glad people are investigating scheming behaviors empirically, and your environments seem like among the best out there. This is really important work.
I agree your ablations in Appendix M are persuasive and that something scheme-y and misaligned is definitely going on. But I took it from this figure that this behavior is fairly rare?
(I’m looking at the Sandbagging QA (No Goal) row here).
I think scheming that happens rarely is not well described as “scheming” in the classical sense, and is less concerning. But is still dangerous, and I agree is also not consistent with “instruction following.”
The main reason rare scheming is less concerning is that, in my experience, rare-schemers tend to report on each other most of the time. So untrusted monitoring usually mitigates most of the risk.
> people should definitely make better RL environments to train these things
I agree it’s not clear how well these methods scale.