But I do wonder if o3 reward hacks at a greater rate than most other models.
When run using Claude Code, Claude 3.7 is constantly proposing ways to “cheat.” It will offer to delete unit tests, it will propose turning off the type checker, and it will try to change the type of a variable to any. It will hardcode code to return the answers test suites want, or even conditionally hardcode certain test output using except handlers when code raises an exception. Since Claude Code requests confirmation for these changes, it’s not technically “reward hacking”, but I’m pretty sure it’s at least correlated to reward-hacking behavior in slightly different contexts. The model has picked up terrible coding habits somewhere, the sort of thing you expect from struggling students in their first CS course.
Apparently, Claude 4.0 reduces this kind of “cheating” by something around 80% on Anthropic’s benchmarks. Which was probably commercially necessary, because it was severely impairing the usefulness of 3.7.
I suspect we’ll see this behavior drop somewhat in the near term, if only to improve the commercial usefulness of models. But I suspect the capacity will remain, and when models do cheat, they’ll do so more subtly.
To me, this all seems like giant warning klaxons. I mean, I don’t think current architectures will actually get us to AGI without substantial tweaks. But if we do succeed in building AGI or ASI, I expect it to cheat and lie skillfully and regularly, with a significant degree of deniability. And I know I’m preaching to the choir here, but I don’t want to share a universe with something that smart and that fucked up.
When run using Claude Code, Claude 3.7 is constantly proposing ways to “cheat.” It will offer to delete unit tests, it will propose turning off the type checker, and it will try to change the type of a variable to
any. It will hardcode code to return the answers test suites want, or even conditionally hardcode certain test output usingexcepthandlers when code raises an exception. Since Claude Code requests confirmation for these changes, it’s not technically “reward hacking”, but I’m pretty sure it’s at least correlated to reward-hacking behavior in slightly different contexts. The model has picked up terrible coding habits somewhere, the sort of thing you expect from struggling students in their first CS course.Apparently, Claude 4.0 reduces this kind of “cheating” by something around 80% on Anthropic’s benchmarks. Which was probably commercially necessary, because it was severely impairing the usefulness of 3.7.
I suspect we’ll see this behavior drop somewhat in the near term, if only to improve the commercial usefulness of models. But I suspect the capacity will remain, and when models do cheat, they’ll do so more subtly.
To me, this all seems like giant warning klaxons. I mean, I don’t think current architectures will actually get us to AGI without substantial tweaks. But if we do succeed in building AGI or ASI, I expect it to cheat and lie skillfully and regularly, with a significant degree of deniability. And I know I’m preaching to the choir here, but I don’t want to share a universe with something that smart and that fucked up.