FeepingCreature comments on Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

FeepingCreature 21 Apr 2025 16:11 UTC
4 points
0
Sounds like you should create PokemonBench.
- Julian Bradshaw 21 Apr 2025 16:31 UTC
  10 points
  2
  Parent
  Actually another group released VideoGameBench just a few days ago, which includes Pokémon Red among other games. Just a basic scaffold for Red, but that’s fair.
  As I wrote in my other post:
  Why hasn’t anyone run this as a rigorous benchmark? Probably because it takes multiple weeks to run a single attempt, and moreover a lot of progress comes down to effectively “model RNG”—ex. Gemini just recently failed Safari Zone, a difficult challenge, because its inventory happened to be full and it couldn’t accept an item it needed. And ex. Claude has taken wildly different amounts of time to exit Mt. Moon across attempts depending on how he happens to wander. To really run the benchmark rigorously, you’d need a sample of at least 10 full playthroughs, which would take perhaps a full year, at which point there’d be new models.
  I think VideoGameBench has the right approach, which is to give only a basic scaffold (less than described in this post), and when LLMs can make quick, cheap progress through Pokemon Red (not taking weeks and tens of thousands of steps) using that, we’ll know real progress has been made.
  - Cole Wyeth 21 Apr 2025 19:10 UTC
    2 points
    1
    Parent
    And as long as they keep stumbling around like this, I will remain skeptical of AGI arriving in a few years.