Ozyrus comments on Is Gemini now better than Claude at Pokémon?

Ozyrus 20 Apr 2025 17:27 UTC
13 points
4
Great post. I’ve been following ClaudePlaysPokemon for sometime, its great to see this grow as comparison/capability tool.
I think it would be much more interesting, though, if the model made scaffolding itself, and had the option to overview its perfomance and try to correct it. Give it required game files/emulators, IDE/OS and watch it try and work around its own limitations. I think it is true that this is more about one coder’s ability to make agent harnesses.
p.s. Honest question: did I miss “agent harness” become the default name for such systems? I thought everyone called those “scaffoldings”—might be just me, though.
- Julian Bradshaw 20 Apr 2025 19:00 UTC
  5 points
  2
  Parent
  I would say “agent harness” is a type of “scaffolding”. I used it in this case because it’s how Logan Kilpatrick described it in the tweet I linked at the beginning of the post.
  - Ozyrus 21 Apr 2025 11:18 UTC
    3 points
    0
    Parent
    Thanks! That makes perfect sense.
- MrCheeze 21 Apr 2025 20:54 UTC
  2 points
  0
  Parent
  (Gemini did actually write much of the Gemini_Plays_Pokemon scaffolding, but only in the sense of doing what David told it to do, not designing and testing it.)
  I think you’re probably right that a LLM coding its own scaffolding is probably more achievable than one playing the game like a human, but I don’t think current models can do it—watching the streams, the models don’t seem like they understand their own flaws, although admittedly they haven’t been prompted to focus on this.
  - Ozyrus 22 Apr 2025 7:33 UTC
    1 point
    0
    Parent
    Not being able to do it right now is perfectly fine, still warrants setting it up to see when exactly they will start to be able to do it.