Zach Stein-Perlman comments on Steve Byrnes’s Shortform

Zach Stein-Perlman 29 Apr 2025 17:46 UTC
LW: 20 AF: 12
9
AF
I agree people often aren’t careful about this.
Anthropic says
During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments . . . . This undesirable special-casing behavior emerged as a result of “reward hacking” during reinforcement learning training.
Similarly OpenAI suggests that cheating behavior is due to RL.
What links here?
- Foom & Doom 2: Technical alignment is hard by Steven Byrnes (23 Jun 2025 17:19 UTC; 152 points)
- Steven Byrnes 29 Apr 2025 19:26 UTC
  LW: 6 AF: 3
  5
  AF Parent
  Thanks!
  I’m now much more sympathetic to a claim like “the reason that o3 lies and cheats is (perhaps) because some reward-hacking happened during its RL post-training”.
  But I still think it’s wrong for a customer to say “Hey I gave o3 this programming problem, and it reward-hacked by editing the unit tests.”
  - Cole Wyeth 29 Apr 2025 21:19 UTC
    4 points
    0
    Parent
    Yes, you’re technically right.