The ClaudePlaysPokemon twitch stream continues to state that Claude hasn’t been trained to play pokemon. That, plus the relatively minimal harness/scaffold, makes Pokemon an interesting benchmark for long-horizon agency generalization / out-of-distribution performance.
I’ve asked Claude to compile data for me and build some nice graphs. (Warning: For all I know some of this is hallucinated or based on terrible modelling assumptions. Do not trust.)
Here’s the graph of human vs. Claudes with the y-axis being step count (i.e. number of button presses, I think) [EDIT: Josh corrects this in the comments; the number of button presses is something like 2.5x the number of steps. Each step is an action taken by Claude. By contrast, the human ‘steps’ are backed out from a guess about 78 button presses/min for speedrunners. So if we convert the graph to button presses, we should basically multiply all the AIs by 2.5x on this graph, and then probably lower the median human line also by a significant factor since the median human probably does less than 60 button presses per minute.]
And here’s the graph of human vs. Claudes with the x-axis being time (notably different because at least on Twitch Claude takes many seconds to do each action/step, whereas humans are usually pressing buttons at a much faster rate)
Eyeballing it, it looks like the various versions of Claude have crossed about half the distance from where they were a year ago, to median-human-level play? (As measured in hours. Measured in steps, they are already better than median humans?)
Not sure of that makes sense. Maybe it’s hallucinated.
It’s too bad the BALROG benchmark isn’t being updated with the newest models. Nethack is both really hard, gives a floating point score, and is text-based, so if a model is vision-impaired (like the Claudes) there’s less contamination through “the model just can’t see where it is”.
> Here’s the graph of human vs. Claudes with the y-axis being step count (i.e. number of button presses, I think)
For this, Claude’s steps aren’t button pressing, they are a round of asking Claude what to do next, but Claude can do a couple button presses for each step if he decides to.
I’d guess 2.5, plenty of times it just does one button and then waits to see the results, with the longer steps being where it navigates to a spot on the screen (very common) or scrolls up/down through a menu (uncommon).
If you/I got the logs from the dev, a firm average would be easy to calculate.
Update: New data has come in! (I got the milestones from Reddit and had Claude make these graphs, so be warned they might be wrong.)
I venture to guess that by the end of 2026, there will be an AI system accessible to members of the public like me, that can build its own scaffold for a typical turn-based video game and (within a month or so) beat said game. So e.g. by the end of the year I should be able to get Claude Code to beat Advance Wars 2 https://kbhgames.com/game/advance-wars-2, perhaps with a few hours of my own time spent going back and forth with it to iterate on scaffolds, and perhaps with a false start or two where it gets stuck partway through the game due to some issue with the memory management system or whatever.
(Come to think of it, this MIGHT be possible today for all I know. Maybe I should check one of these days...)
A bit of a nitpick but 78 steps/min for a human seems very fast, that’s more the speed I’d play an RTS at than a turn-based RPG. I guess that makes sense if that’s the speedrun speed, less so for the casual playthrough.
Yeah, I agree; the conclusion is that median human performance in the steps graph should be higher. It was always suspicious that it was worse than the models anyway.
Looking at the step count comparisons instead of time is interesting. Claude Opus 4.5 is currently at ~44,500 steps in Silph co., where it has been stuck for several days. So that should now be about 50% higher. The others look roughly right for Opus. It beat Mt. Moon in around 5 hours and was stuck at the Rocket Hideout for days.
I think the Gemini 3 Pro vs 2.5 Pro matchup in Pokemon Crystal was interesting. Gemini 3 cleared the game in ~424.5 hours last night while 2.5 only had 4⁄16 badges at 435 hours.
Did 3 Pro and 2.5 Pro have the same harness? I assume so… 2.5 was released at the end of March, 3 was released in late November. So 8 month gap. So pokemon ‘speed’ improved by 4x in 8 months. Meanwhile for Claude, 3.7 Sonnet was released in February, so 9 months to get to today’s Opus 4.5. Seems harder to make the comparison for Claude because 3.7 Sonnet seems like it was basically stuck and would essentially never finish. If we focus on the part before it got stuck though, e.g. defeating Surge, it seems like Opus 4.5. is a bit more than 4x faster? But then Opus 4.5 got stuck later in Rocket hideout… but eventually made it through… yeah hard to compare. But it’s interesting that they both are doing something like a 4x improvement in speed over a similar period of time.
Yes, they both started with the same harness but there’s room for each model to customize its own setup so I’m not sure how much they might have diverged over time. I have 4x speedup as probably an upper bound but I was only counting since the final 2.5 stable release in June, which might be too short. Gemini 2.5 has 6 badges now compared to yesterday, so it’s probably too early to assume 4x is certain. But if it was 4x every 8 months then it should be able to match average human playtime by early 2027.
“v2 centers on a smaller, flexible toolset (Notepad, Map Markers, code execution, on‑the‑fly custom agents) so Gemini can build exactly what it needs when it needs it.”
″The AI has access to a set of built-in tools to interact with the game and its own internal state:
notepad_edit: Modifies an internal notepad, allowing the AI to write down strategies, discoveries, and long-term plans.
run_code: Executes Python code in a secure, sandboxed environment for complex calculations or logic that is difficult to perform with reasoning alone.
define_map_marker / delete_map_marker: Adds or removes markers on the internal map to remember important locations like defeated trainers, item locations, or puzzle elements.
stun_npc: Temporarily freezes an NPC in place, which is useful for interacting with them.
select_battle_option: Provides a structured way to choose actions during a battle, such as selecting a move or using an item.
Custom Tools & Agents
The most powerful feature of the system is its ability to self-improve by creating its own tools and specialized agents. You can view the live Notepad and custom tools/agents tracker on GitHub.
Custom Tools (define_tool / delete_tool): If the AI identifies a repetitive or complex data-processing task, it can write and save its own Python scripts as new tools. For example, instead of relying on a pre-built pathfinder, it can write its own pathfinding tool from scratch to navigate complex areas like the spinner mazes in the Team Rocket Hideout.
Custom Agents (define_agent / delete_agent): For complex reasoning tasks, the AI can define new, specialized instances of itself without any distracting context. These agents are given a unique system prompt and purpose, allowing them to excel at specific challenges like solving puzzles or developing high-level battle strategies.”
(Author of post you linked here.) No Claude’s harness has been pretty stable and minimal. It’s the other LLMs that have beat the game with stronger/more optimized harnesses.
They have a doc for the harness changes from model-to-model for this series of runs (claudeplayspokemon). Excerpt on Opus 4.5 changes:
ClaudePlaysPokemon Opus 4.5 Harness Changes
Navigator
Added support for Surf 😉
Marked Spin tiles and Teleport tiles as not navigable so navigation sequences wouldn’t accidentally hit these which was impossible for the model to realize
Made it so when the model hits a spin tile we wait for the player to stop moving before giving the model the next screenshot
Added support for side entrances to gates (previously were marked as non navigable)
Opus called out that it was confusing that tiles that were out of reach but walkable were marked as red, so we updated it to mark those as cyan instead. Possible a better prompt could have fixed this, but it was an easy change
Memory
We’re back to multi-file memory now – Claude is responsible and is able to manage multiple files without losing the plot.
Misc.
I removed a bunch of tooling that told Claude when things were going wrong (e.g. informing Claude it was stuck). Claude doesn’t need this anymore.
Also I let claude enter names faster now 🚀
Hints
I removed all of the hints I used to give models (Claude is pretty good these days). I do have a few examples of mistakes that Claude makes visually that you could interpret as hints, ymmv:
So, yes, it does seem hard to draw many conclusions from the performance differences since we’re far from apples-to-apples. But at least we can see that the harness is not only accommodate the models’ deficiencies, but, over time, also removing assists as new strengths emerge.
The ClaudePlaysPokemon twitch stream continues to state that Claude hasn’t been trained to play pokemon. That, plus the relatively minimal harness/scaffold, makes Pokemon an interesting benchmark for long-horizon agency generalization / out-of-distribution performance.
I’ve asked Claude to compile data for me and build some nice graphs. (Warning: For all I know some of this is hallucinated or based on terrible modelling assumptions. Do not trust.)
Here’s the graph of human vs. Claudes with the y-axis being step count (i.e. number of button presses, I think) [EDIT: Josh corrects this in the comments; the number of button presses is something like 2.5x the number of steps. Each step is an action taken by Claude. By contrast, the human ‘steps’ are backed out from a guess about 78 button presses/min for speedrunners. So if we convert the graph to button presses, we should basically multiply all the AIs by 2.5x on this graph, and then probably lower the median human line also by a significant factor since the median human probably does less than 60 button presses per minute.]
And here’s the graph of human vs. Claudes with the x-axis being time (notably different because at least on Twitch Claude takes many seconds to do each action/step, whereas humans are usually pressing buttons at a much faster rate)
Eyeballing it, it looks like the various versions of Claude have crossed about half the distance from where they were a year ago, to median-human-level play? (As measured in hours. Measured in steps, they are already better than median humans?)
Not sure of that makes sense. Maybe it’s hallucinated.
It’s too bad the BALROG benchmark isn’t being updated with the newest models. Nethack is both really hard, gives a floating point score, and is text-based, so if a model is vision-impaired (like the Claudes) there’s less contamination through “the model just can’t see where it is”.
> Here’s the graph of human vs. Claudes with the y-axis being step count (i.e. number of button presses, I think)
For this, Claude’s steps aren’t button pressing, they are a round of asking Claude what to do next, but Claude can do a couple button presses for each step if he decides to.
Thanks! Got a sense of how many button presses happen on average per step? That would be helpful for making an apples-to-apples comparison.
I’d guess 2.5, plenty of times it just does one button and then waits to see the results, with the longer steps being where it navigates to a spot on the screen (very common) or scrolls up/down through a menu (uncommon).
If you/I got the logs from the dev, a firm average would be easy to calculate.
Update: New data has come in! (I got the milestones from Reddit and had Claude make these graphs, so be warned they might be wrong.)
I venture to guess that by the end of 2026, there will be an AI system accessible to members of the public like me, that can build its own scaffold for a typical turn-based video game and (within a month or so) beat said game. So e.g. by the end of the year I should be able to get Claude Code to beat Advance Wars 2 https://kbhgames.com/game/advance-wars-2, perhaps with a few hours of my own time spent going back and forth with it to iterate on scaffolds, and perhaps with a false start or two where it gets stuck partway through the game due to some issue with the memory management system or whatever.
(Come to think of it, this MIGHT be possible today for all I know. Maybe I should check one of these days...)
Log scale version:
A bit of a nitpick but 78 steps/min for a human seems very fast, that’s more the speed I’d play an RTS at than a turn-based RPG. I guess that makes sense if that’s the speedrun speed, less so for the casual playthrough.
Yeah, I agree; the conclusion is that median human performance in the steps graph should be higher. It was always suspicious that it was worse than the models anyway.
Looking at the step count comparisons instead of time is interesting. Claude Opus 4.5 is currently at ~44,500 steps in Silph co., where it has been stuck for several days. So that should now be about 50% higher. The others look roughly right for Opus. It beat Mt. Moon in around 5 hours and was stuck at the Rocket Hideout for days.
I think the Gemini 3 Pro vs 2.5 Pro matchup in Pokemon Crystal was interesting. Gemini 3 cleared the game in ~424.5 hours last night while 2.5 only had 4⁄16 badges at 435 hours.
Did 3 Pro and 2.5 Pro have the same harness? I assume so… 2.5 was released at the end of March, 3 was released in late November. So 8 month gap. So pokemon ‘speed’ improved by 4x in 8 months. Meanwhile for Claude, 3.7 Sonnet was released in February, so 9 months to get to today’s Opus 4.5. Seems harder to make the comparison for Claude because 3.7 Sonnet seems like it was basically stuck and would essentially never finish. If we focus on the part before it got stuck though, e.g. defeating Surge, it seems like Opus 4.5. is a bit more than 4x faster? But then Opus 4.5 got stuck later in Rocket hideout… but eventually made it through… yeah hard to compare. But it’s interesting that they both are doing something like a 4x improvement in speed over a similar period of time.
Yes, they both started with the same harness but there’s room for each model to customize its own setup so I’m not sure how much they might have diverged over time. I have 4x speedup as probably an upper bound but I was only counting since the final 2.5 stable release in June, which might be too short. Gemini 2.5 has 6 badges now compared to yesterday, so it’s probably too early to assume 4x is certain. But if it was 4x every 8 months then it should be able to match average human playtime by early 2027.
From the Gemini_Plays_Pokemon—Twitch:
“v2 centers on a smaller, flexible toolset (Notepad, Map Markers, code execution, on‑the‑fly custom agents) so Gemini can build exactly what it needs when it needs it.”
″The AI has access to a set of built-in tools to interact with the game and its own internal state:
notepad_edit: Modifies an internal notepad, allowing the AI to write down strategies, discoveries, and long-term plans.
run_code: Executes Python code in a secure, sandboxed environment for complex calculations or logic that is difficult to perform with reasoning alone.
define_map_marker / delete_map_marker: Adds or removes markers on the internal map to remember important locations like defeated trainers, item locations, or puzzle elements.
stun_npc: Temporarily freezes an NPC in place, which is useful for interacting with them.
select_battle_option: Provides a structured way to choose actions during a battle, such as selecting a move or using an item.
Custom Tools & Agents
The most powerful feature of the system is its ability to self-improve by creating its own tools and specialized agents. You can view the live Notepad and custom tools/agents tracker on GitHub.
Custom Tools (define_tool / delete_tool): If the AI identifies a repetitive or complex data-processing task, it can write and save its own Python scripts as new tools. For example, instead of relying on a pre-built pathfinder, it can write its own pathfinding tool from scratch to navigate complex areas like the spinner mazes in the Team Rocket Hideout.
Custom Agents (define_agent / delete_agent): For complex reasoning tasks, the AI can define new, specialized instances of itself without any distracting context. These agents are given a unique system prompt and purpose, allowing them to excel at specific challenges like solving puzzles or developing high-level battle
strategies.”
But no version of Claude has actually beaten the game, so it seems a bit strange to say they’ve crossed half the distance to median human play…
Im not sure I should update on this, since it is from an AI.
Also, it sounds like Gemini has beaten Pokémon with a fixed harness now?? But we don’t know where it was trained on Pokémon?
I agree we probably shouldn’t update much on this, it’s from AI and it’s janky.
As for beating the game… well sure, but based on the above graphs it seems like Claude will beat the game within about a year?
Other models have beaten the game months ago, but with more advanced harnesses/scaffolds.
I’m under the impression that the harness has been adjusted over time to fit Claude’s deficiencies: https://www.lesswrong.com/posts/7mqp8uRnnPdbBzJZE/is-gemini-now-better-than-claude-at-pokemon
Therefore this benchmark is really benchmarking human+ai capability.
(Author of post you linked here.) No Claude’s harness has been pretty stable and minimal. It’s the other LLMs that have beat the game with stronger/more optimized harnesses.
Thanks for the clarification!
They have a doc for the harness changes from model-to-model for this series of runs (claudeplayspokemon). Excerpt on Opus 4.5 changes:
So, yes, it does seem hard to draw many conclusions from the performance differences since we’re far from apples-to-apples. But at least we can see that the harness is not only accommodate the models’ deficiencies, but, over time, also removing assists as new strengths emerge.