Thane Ruthenis comments on GPT-4o Sycophancy Post Mortem

Thane Ruthenis 5 May 2025 18:00 UTC
14 points
2
As in, if your training environment rewards cheating, the model will generalize that to cheating in general.
I find it rather amusing that this is the first potential significant instance of that phenomenon. “Reasoning techniques LLMs learn during RL in domains with easy verification will generalize to open-ended domains!” → “wait no not like that”.
Though, of course, that’s also kinda concerning. Remember Ominous but Exciting OpenAI Vagueposts™ about “magic is what happens when an unstoppable RL optimization algorithm powered by sufficient compute meets an unhackable RL environment”? Well, clearly those environments of theirs did not, in fact, turn out to be unhackable.
Now put the pieces together:
- OpenAI researchers boldly proclaim their RL environments to be unhackable.
- Their environments turn out not to be unhackable.
- RL on CoTs in fact teaches models to hack their training environments if they’re able to.
- OpenAI researchers don’t catch the hacking mid-training early enough to prevent this, or don’t want to.
  - (Same for Anthropic researchers, if Sonnet 3.7 is any evidence. But I guess DeepMind’s models don’t exhibit this?)
- The resultant propensity to Goodhart to the specification then, evidently, Generalizes Further Than Alignment.
That’s such a classic recipe for AGI Doom. Failing with even less dignity than expected, indeed.
(The potential saving grace here is, of course, that this potentially dooms the paradigm, because it seems that Propensity to Hack Generalizes Further Than Capabilities as well. Meaning RL-on-CoTs won’t, in fact, be able to make capabilities go exponential.)