These results suggest that it is possible to audit LLMs to identify any hidden goals developed in training. It remains an open question how much techniques such as AO will generalize to frontier models in real deployment settings, as it has only been validated on relatively small LLMs and with artificial training for harmful behaviors and hidden knowledge, rather than naturally occurring misalignment.
We have something pretty close to naturally occuring misaligned models with hidden goals: the ones from “Natural Emergent Misalignment From Reward Hacking In Production RL” https://arxiv.org/pdf/2511.18397v1
That’s from taking actual production RL training processes, cranking factors up in foolish ways until the model learns to reward hack, and then the effect that reward hacking has on the model’s behavior. Which isn’t the cartoonish evil that standard emergent misalignment produces training, but rather something unscrupulous, scheming and focussed on reward hacking and not getting caught or stopped. As model organisms go, it’s a pretty realistic one that a company the implemented RLVR badly could actually produce unintentionally. It’s actually scheming and actually shows alignment-faking.
But, as I understand the AO paper, they do not actually validate on models like these, they only test on models that they have trained to be misaligned in pretty straightforward ways. I still think it’s good work, I’m moreso just talking about a gap that future work could address.
We have something pretty close to naturally occuring misaligned models with hidden goals: the ones from “Natural Emergent Misalignment From Reward Hacking In Production RL” https://arxiv.org/pdf/2511.18397v1
That’s from taking actual production RL training processes, cranking factors up in foolish ways until the model learns to reward hack, and then the effect that reward hacking has on the model’s behavior. Which isn’t the cartoonish evil that standard emergent misalignment produces training, but rather something unscrupulous, scheming and focussed on reward hacking and not getting caught or stopped. As model organisms go, it’s a pretty realistic one that a company the implemented RLVR badly could actually produce unintentionally. It’s actually scheming and actually shows alignment-faking.
But, as I understand the AO paper, they do not actually validate on models like these, they only test on models that they have trained to be misaligned in pretty straightforward ways. I still think it’s good work, I’m moreso just talking about a gap that future work could address.
I agree, and was pointing in the direction of the nearest thing to that I’m currently aware of, for someone who wanted to try this now.