ACCount comments on METR’s Observations of Reward Hacking in Recent Frontier Models

ACCount 9 Jun 2025 22:14 UTC
8 points
0
We’ve also noticed similar behaviors for other recent models, such as Claude 3.7 Sonnet and o1. In this post, we focus on o3 because we have many examples to draw from for it, but we believe that this kind of reward hacking is a general phenomenon, not isolated to any one model or developer.
There’s no reason to believe that reward hacking is unique to o3, no. But I do wonder if o3 reward hacks at a greater rate than most other models.
We’ve already seen numerous reports of o3 displaying an increased rate of hallucinations, and low truthfulness in general. I wonder if OpenAI has cooked o3 with a very “leaky” RL regiment that accidentally encouraged it to lie and cheat and hallucinate in pursuit of better performance metrics.
Many verifiable tasks completely fail to discourage this kind of behavior by default. A test like SAT doesn’t encourage saying “I don’t know” when you don’t know—giving any answer at all has a chance of being correct, saying “I don’t know” doesn’t. Thus, RL on SAT-like tests would encourage hallucinations.
- Random Developer 10 Jun 2025 10:42 UTC
  8 points
  5
  Parent
  But I do wonder if o3 reward hacks at a greater rate than most other models.
  When run using Claude Code, Claude 3.7 is constantly proposing ways to “cheat.” It will offer to delete unit tests, it will propose turning off the type checker, and it will try to change the type of a variable to any. It will hardcode code to return the answers test suites want, or even conditionally hardcode certain test output using except handlers when code raises an exception. Since Claude Code requests confirmation for these changes, it’s not technically “reward hacking”, but I’m pretty sure it’s at least correlated to reward-hacking behavior in slightly different contexts. The model has picked up terrible coding habits somewhere, the sort of thing you expect from struggling students in their first CS course.
  Apparently, Claude 4.0 reduces this kind of “cheating” by something around 80% on Anthropic’s benchmarks. Which was probably commercially necessary, because it was severely impairing the usefulness of 3.7.
  I suspect we’ll see this behavior drop somewhat in the near term, if only to improve the commercial usefulness of models. But I suspect the capacity will remain, and when models do cheat, they’ll do so more subtly.
  To me, this all seems like giant warning klaxons. I mean, I don’t think current architectures will actually get us to AGI without substantial tweaks. But if we do succeed in building AGI or ASI, I expect it to cheat and lie skillfully and regularly, with a significant degree of deniability. And I know I’m preaching to the choir here, but I don’t want to share a universe with something that smart and that fucked up.