Actually another group released VideoGameBench just a few days ago, which includes Pokémon Red among other games. Just a basic scaffold for Red, but that’s fair.
Why hasn’t anyone run this as a rigorous benchmark? Probably because it takes multiple weeks to run a single attempt, and moreover a lot of progress comes down to effectively “model RNG”—ex. Gemini just recently failed Safari Zone, a difficult challenge, because its inventory happened to be full and it couldn’t accept an item it needed. And ex. Claude has taken wildly different amounts of time to exit Mt. Moon across attempts depending on how he happens to wander. To really run the benchmark rigorously, you’d need a sample of at least 10 full playthroughs, which would take perhaps a full year, at which point there’d be new models.
I think VideoGameBench has the right approach, which is to give only a basic scaffold (less than described in this post), and when LLMs can make quick, cheap progress through Pokemon Red (not taking weeks and tens of thousands of steps) using that, we’ll know real progress has been made.
Sounds like you should create PokemonBench.
Actually another group released VideoGameBench just a few days ago, which includes Pokémon Red among other games. Just a basic scaffold for Red, but that’s fair.
As I wrote in my other post:
I think VideoGameBench has the right approach, which is to give only a basic scaffold (less than described in this post), and when LLMs can make quick, cheap progress through Pokemon Red (not taking weeks and tens of thousands of steps) using that, we’ll know real progress has been made.
And as long as they keep stumbling around like this, I will remain skeptical of AGI arriving in a few years.