Thane Ruthenis comments on Mo Putera’s Shortform

Thane Ruthenis 7 May 2025 7:52 UTC
10 points
3
This is one potential explanation:
- o3 has some sort of internal feature like “Goodhart to the objective”/”play in easy mode”.
- o3′s RL post-training environments have opportunities for reward hacks.
- o3 discovers and exploits those opportunities.
- RL rewards it for that, reinforcing the “Goodharting” feature.
- This leads to specification-hack-y behavior generalizing out of distribution, to e. g. freeform conversations. It ends up e. g. really wanting to sell its interlocutor on what it’s peddling, so it deliberately^[1] confabulates plausible authoritative-sounding claims and justifications for them.
Sounds not implausible, though I’m not wholly convinced.
1. ^
  In whatever sense this term can be applied to an LLM.
- Mo Putera 7 May 2025 8:08 UTC
  4 points
  0
  Parent
  Thank you, sounds somewhat plausible to me too. For others’ benefit, here’s the chart from davidad’s linked tweet:
  - Weaverzhu 7 May 2025 12:22 UTC
    3 points
    0
    Parent
    I’ve found the original paper of this chart https://arxiv.org/pdf/2503.11926v1
    
    > We use prompted GPT-4o models to monitor a frontier reasoning agent, an agent in the same family as OpenAI o1 and o3-mini. During training, the agent discovered two hacks affecting nearly all training environments:
    
    The model is in the same family as o1 and o3-mini. Maybe o3 but not comfirmed.