Gemini 2.5 Pro (05-06) version just beat Pokémon Blue for the second time, taking 36,801 actions/406 hours. This is a significant improvement over the previous run, which used an earlier version of Gemini (and a less-developed scaffold) and took ~106,500 actions/816 hours.
For comparison, Claude 3.7 Sonnet took ~35,000 actions to get just 3 badges, and the aborted public run also only got 3 badges after over 200,000 actions. However, most of the difference is that Gemini is using a more advanced scaffold.
Gemini still generally makes boneheaded decisions. It took 7 tries to beat the Elite Four and Champion this time, being less overlevelled. Mistakes included:
Re-solving Victory Road for many hours after failures, despite being able to fly straight to the Elite Four after making it there the first time.
Not buying proper healing items/thinking the “Full Heal” item actually heals its Pokémon. (This is an infamously misleadingly-named item, it just cures status conditions. Still, Gemini knows the real effect if you ask it directly, it just doesn’t apply that knowledge consistently.)
Not using available revive items to bring back its only strong Pokémon, even after having just used them previously. (threw an attempt where the Champion was down to their final Pokémon once like that)
Anyway, not too much new info here. The newer models (latest Gemini, o3, and Claude 4 Opus) are only somewhat better at Pokémon. The effect scaffolds have on performance does say something about how much low-hanging fruit there is for current-LLM implementation: they can be a lot more effective when given the right tools/prompting. But, we already knew that.
Gemini 2.5 Pro (05-06) version just beat Pokémon Blue for the second time, taking 36,801 actions/406 hours. This is a significant improvement over the previous run, which used an earlier version of Gemini (and a less-developed scaffold) and took ~106,500 actions/816 hours.
For comparison, Claude 3.7 Sonnet took ~35,000 actions to get just 3 badges, and the aborted public run also only got 3 badges after over 200,000 actions. However, most of the difference is that Gemini is using a more advanced scaffold.
Gemini still generally makes boneheaded decisions. It took 7 tries to beat the Elite Four and Champion this time, being less overlevelled. Mistakes included:
Re-solving Victory Road for many hours after failures, despite being able to fly straight to the Elite Four after making it there the first time.
Not buying proper healing items/thinking the “Full Heal” item actually heals its Pokémon. (This is an infamously misleadingly-named item, it just cures status conditions. Still, Gemini knows the real effect if you ask it directly, it just doesn’t apply that knowledge consistently.)
Not using available revive items to bring back its only strong Pokémon, even after having just used them previously. (threw an attempt where the Champion was down to their final Pokémon once like that)
Anyway, not too much new info here. The newer models (latest Gemini, o3, and Claude 4 Opus) are only somewhat better at Pokémon. The effect scaffolds have on performance does say something about how much low-hanging fruit there is for current-LLM implementation: they can be a lot more effective when given the right tools/prompting. But, we already knew that.