joshc comments on If anyone builds it, everyone will plausibly be fine

joshc 22 Sep 2025 17:08 UTC
3 points
1
Hey Bronson,

Thanks for your paper. I’m really glad people are investigating scheming behaviors empirically, and your environments seem like among the best out there. This is really important work.

I agree your ablations in Appendix M are persuasive and that something scheme-y and misaligned is definitely going on. But I took it from this figure that this behavior is fairly rare?

(I’m looking at the Sandbagging QA (No Goal) row here).
I think scheming that happens rarely is not well described as “scheming” in the classical sense, and is less concerning. But is still dangerous, and I agree is also not consistent with “instruction following.”

The main reason rare scheming is less concerning is that, in my experience, rare-schemers tend to report on each other most of the time. So untrusted monitoring usually mitigates most of the risk.

> people should definitely make better RL environments to train these things

I agree it’s not clear how well these methods scale.