Great post. I’ve been following ClaudePlaysPokemon for sometime, its great to see this grow as comparison/capability tool. I think it would be much more interesting, though, if the model made scaffolding itself, and had the option to overview its perfomance and try to correct it. Give it required game files/emulators, IDE/OS and watch it try and work around its own limitations. I think it is true that this is more about one coder’s ability to make agent harnesses. p.s. Honest question: did I miss “agent harness” become the default name for such systems? I thought everyone called those “scaffoldings”—might be just me, though.
I would say “agent harness” is a type of “scaffolding”. I used it in this case because it’s how Logan Kilpatrick described it in the tweet I linked at the beginning of the post.
(Gemini did actually write much of the Gemini_Plays_Pokemon scaffolding, but only in the sense of doing what David told it to do, not designing and testing it.)
I think you’re probably right that a LLM coding its own scaffolding is probably more achievable than one playing the game like a human, but I don’t think current models can do it—watching the streams, the models don’t seem like they understand their own flaws, although admittedly they haven’t been prompted to focus on this.
Great post. I’ve been following ClaudePlaysPokemon for sometime, its great to see this grow as comparison/capability tool.
I think it would be much more interesting, though, if the model made scaffolding itself, and had the option to overview its perfomance and try to correct it. Give it required game files/emulators, IDE/OS and watch it try and work around its own limitations. I think it is true that this is more about one coder’s ability to make agent harnesses.
p.s. Honest question: did I miss “agent harness” become the default name for such systems? I thought everyone called those “scaffoldings”—might be just me, though.
I would say “agent harness” is a type of “scaffolding”. I used it in this case because it’s how Logan Kilpatrick described it in the tweet I linked at the beginning of the post.
Thanks! That makes perfect sense.
(Gemini did actually write much of the Gemini_Plays_Pokemon scaffolding, but only in the sense of doing what David told it to do, not designing and testing it.)
I think you’re probably right that a LLM coding its own scaffolding is probably more achievable than one playing the game like a human, but I don’t think current models can do it—watching the streams, the models don’t seem like they understand their own flaws, although admittedly they haven’t been prompted to focus on this.
Not being able to do it right now is perfectly fine, still warrants setting it up to see when exactly they will start to be able to do it.