Daniel Kokotajlo comments on Recent AI model progress feels mostly like bullshit

Daniel Kokotajlo 24 Mar 2025 23:56 UTC
51 points
20
Personally, when I want to get a sense of capability improvements in the future, I’m going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.
I was going to say exactly that lol. Claude has improved substantially on Claude Plays Pokemon:
- MrCheeze 28 Apr 2025 14:04 UTC
  16 points
  12
  Parent
  But you have to be careful here, since the results heavily depend on details of the harness, as well as on how thoroughly they have memorized walkthroughs of the game.
- Yoel Cabo 7 Apr 2025 12:17 UTC
  4 points
  3
  Parent
  I think the “number of actions” axis is key here.
  This post explains it well: https://www.lesswrong.com/posts/HyD3khBjnBhvsp8Gb/so-how-well-is-claude-playing-pokemon. I’ve been watching Claude Plays Pokémonand chatting with the Twitch folks. The post matches my experience.
  There’s plenty of room for improvement in prompt design and tooling, but Claude is still far behind the performance of my 7-year-old (unfair comparison, I know). So I agree with OP, this is an excellent benchmark to watch:
  - It’s not saturated yet.
  - It tests core capabilities for agentic behavior. If an AI can’t beat Pokémon, it can’t replace a programmer.
  - It gives a clear qualitative feel for competence—just watch five minutes.
  - It’s non-specialized and anyone can evaluate it and have a shared understanding (unlike Cursor, which requires coding experience).
  And once LLMs beat Pokemon Red, I’ll personally will want to see them beat other games as well to make sure the agentic capabilities are generalizing.
  - diogenes 25 May 2025 11:57 UTC
    1 point
    0
    Parent
    If relying specifically on Pokemon isn’t there the risk of models (either incidentally or intentionally) being overtrained on pokemon-related data and seeing a boost of performance that way?
    Branching out to other games sooner rather than later seems sensible.