That really is surprising, especially given that the announcement includes the following:
Claude Opus 4 also dramatically outperforms all previous models on memory capabilities. When developers build applications that provide Claude local file access, Opus 4 becomes skilled at creating and maintaining ‘memory files’ to store key information. This unlocks better long-term task awareness, coherence, and performance on agent tasks—like Opus 4 creating a ‘Navigation Guide’ while playing Pokémon.
I’d pretty much assumed they fine-tuned on the Pokémon task and it was going to be just about one-shotting the game. Weird.
Yeah I feel like they came up with something nice to say while eliding the “no further progress” issue.
Weirdly, while the announcement talks about creating and maintaining multiple “memory files”, the new public ClaudePlaysPokemon stream has Claude Opus 4 using just a single memory file which it doesn’t even create. Apparently this is “much better” than the setup Claude 3.7 Sonnet used which let it create and maintain as many files at it wanted (usually to its detriment).
One other interesting tidbit I’ll throw in from the stream:
claudestans: @ClaudePlaysPokemon you mentioned before that there was a personality change from e.g. 3.5 sonnet to 3.7 (more persistent, less giving up etc). Have you noticed anything about opus 4 in terms of personality?
ClaudePlaysPokemon: Opus is so much better at keeping track of things that it gets more distressed when it can’t figure things out! So I need to convince it more that nothing is wrong, which I find quite interesting! ClaudePlaysPokemon: like it will be very aware that it has taken 100 steps to solve something and it finds that very frustrating
Huh, indeed interesting. IIRC, one of the suggested problems with the previous models was the lack of boredom: that they’re perfectly capable of doing something in a loop forever where a human would’ve gotten frustrated and done something random that might’ve ended up helping. Sounds like Opus 4 is different in that regard...?
That really is surprising, especially given that the announcement includes the following:
I’d pretty much assumed they fine-tuned on the Pokémon task and it was going to be just about one-shotting the game. Weird.
I found that section very suspicious because it omitted any statement about actual performance, and I guess now we know why.
This seems in line with my longer timelines hypothesis. Perhaps it roughly undoes the update on AlphaEvolve, which I wasn’t sure how to interpret.
Of course the METR evaluation will contain more signal.
Yeah I feel like they came up with something nice to say while eliding the “no further progress” issue.
Weirdly, while the announcement talks about creating and maintaining multiple “memory files”, the new public ClaudePlaysPokemon stream has Claude Opus 4 using just a single memory file which it doesn’t even create. Apparently this is “much better” than the setup Claude 3.7 Sonnet used which let it create and maintain as many files at it wanted (usually to its detriment).
(source: this doc David Hershey just published on the new harness for the stream)
One other interesting tidbit I’ll throw in from the stream:
Huh, indeed interesting. IIRC, one of the suggested problems with the previous models was the lack of boredom: that they’re perfectly capable of doing something in a loop forever where a human would’ve gotten frustrated and done something random that might’ve ended up helping. Sounds like Opus 4 is different in that regard...?