Personally, when I want to get a sense of capability improvements in the future, I’m going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.
I was going to say exactly that lol. Claude has improved substantially on Claude Plays Pokemon:
But you have to be careful here, since the results heavily depend on details of the harness, as well as on how thoroughly they have memorized walkthroughs of the game.
There’s plenty of room for improvement in prompt design and tooling, but Claude is still far behind the performance of my 7-year-old (unfair comparison, I know). So I agree with OP, this is an excellent benchmark to watch:
It’s not saturated yet.
It tests core capabilities for agentic behavior. If an AI can’t beat Pokémon, it can’t replace a programmer.
It gives a clear qualitative feel for competence—just watch five minutes.
It’s non-specialized and anyone can evaluate it and have a shared understanding (unlike Cursor, which requires coding experience).
And once LLMs beat Pokemon Red, I’ll personally will want to see them beat other games as well to make sure the agentic capabilities are generalizing.
If relying specifically on Pokemon isn’t there the risk of models (either incidentally or intentionally) being overtrained on pokemon-related data and seeing a boost of performance that way?
Branching out to other games sooner rather than later seems sensible.
I was going to say exactly that lol. Claude has improved substantially on Claude Plays Pokemon:
But you have to be careful here, since the results heavily depend on details of the harness, as well as on how thoroughly they have memorized walkthroughs of the game.
I think the “number of actions” axis is key here.
This post explains it well: https://www.lesswrong.com/posts/HyD3khBjnBhvsp8Gb/so-how-well-is-claude-playing-pokemon. I’ve been watching Claude Plays Pokémonand chatting with the Twitch folks. The post matches my experience.
There’s plenty of room for improvement in prompt design and tooling, but Claude is still far behind the performance of my 7-year-old (unfair comparison, I know). So I agree with OP, this is an excellent benchmark to watch:
It’s not saturated yet.
It tests core capabilities for agentic behavior. If an AI can’t beat Pokémon, it can’t replace a programmer.
It gives a clear qualitative feel for competence—just watch five minutes.
It’s non-specialized and anyone can evaluate it and have a shared understanding (unlike Cursor, which requires coding experience).
And once LLMs beat Pokemon Red, I’ll personally will want to see them beat other games as well to make sure the agentic capabilities are generalizing.
If relying specifically on Pokemon isn’t there the risk of models (either incidentally or intentionally) being overtrained on pokemon-related data and seeing a boost of performance that way?
Branching out to other games sooner rather than later seems sensible.