o3 has some sort of internal feature like “Goodhart to the objective”/”play in easy mode”.
o3′s RL post-training environments have opportunities for reward hacks.
o3 discovers and exploits those opportunities.
RL rewards it for that, reinforcing the “Goodharting” feature.
This leads to specification-hack-y behavior generalizing out of distribution, to e. g. freeform conversations. It ends up e. g. really wanting to sell its interlocutor on what it’s peddling, so it deliberately[1] confabulates plausible authoritative-sounding claims and justifications for them.
Sounds not implausible, though I’m not wholly convinced.
> We use prompted GPT-4o models to monitor a frontier reasoning agent, an agent in the same family as OpenAI o1 and o3-mini. During training, the agent discovered two hacks affecting nearly all training environments:
The model is in the same family as o1 and o3-mini. Maybe o3 but not comfirmed.
This is one potential explanation:
o3 has some sort of internal feature like “Goodhart to the objective”/”play in easy mode”.
o3′s RL post-training environments have opportunities for reward hacks.
o3 discovers and exploits those opportunities.
RL rewards it for that, reinforcing the “Goodharting” feature.
This leads to specification-hack-y behavior generalizing out of distribution, to e. g. freeform conversations. It ends up e. g. really wanting to sell its interlocutor on what it’s peddling, so it deliberately[1] confabulates plausible authoritative-sounding claims and justifications for them.
Sounds not implausible, though I’m not wholly convinced.
In whatever sense this term can be applied to an LLM.
Thank you, sounds somewhat plausible to me too. For others’ benefit, here’s the chart from davidad’s linked tweet:
I’ve found the original paper of this chart https://arxiv.org/pdf/2503.11926v1
> We use prompted GPT-4o models to monitor a frontier reasoning agent, an agent in the same family as OpenAI o1 and o3-mini. During training, the agent discovered two hacks affecting nearly all training environments:
The model is in the same family as o1 and o3-mini. Maybe o3 but not comfirmed.