qazzquimby comments on Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

qazzquimby 22 Apr 2025 3:35 UTC
4 points
1
I’d be very interested in talking to anonymous friend, or anyone else working on this. I have two relevant projects.
Most directly, I wrote a harness for llms to play text adventures and have spent some time optimizing the wrapper and testing on anchorhead. As you’d expect it has the same issues, but cheaper and without vision problems.
I’ve also worked on llm social deduction game play, which is nuanced and challenging in different ways, but shares the need for strong memory and robust reasoning in the face of hallucination.
I’d be happy to talk about any of these issues and compare leads!
- Julian Bradshaw 22 Apr 2025 19:33 UTC
  4 points
  0
  Parent
  I’ll let you know. They’re working on open-sourcing their scaffold at the moment.
  Edit: Publicly released here.
- MrCheeze 22 Apr 2025 19:03 UTC
  3 points
  1
  Parent
  Text adventures do seem like a good eval right now, since they’re the ONLY games that can be tested without either relying on vision (which is still very bad), or writing a custom harness for each game (in which case your results depend heavily on the harness).