I’d be very interested in talking to anonymous friend, or anyone else working on this. I have two relevant projects.
Most directly, I wrote a harness for llms to play text adventures and have spent some time optimizing the wrapper and testing on anchorhead. As you’d expect it has the same issues, but cheaper and without vision problems.
I’ve also worked on llm social deduction game play, which is nuanced and challenging in different ways, but shares the need for strong memory and robust reasoning in the face of hallucination.
I’d be happy to talk about any of these issues and compare leads!
Text adventures do seem like a good eval right now, since they’re the ONLY games that can be tested without either relying on vision (which is still very bad), or writing a custom harness for each game (in which case your results depend heavily on the harness).
I’d be very interested in talking to anonymous friend, or anyone else working on this. I have two relevant projects.
Most directly, I wrote a harness for llms to play text adventures and have spent some time optimizing the wrapper and testing on anchorhead. As you’d expect it has the same issues, but cheaper and without vision problems.
I’ve also worked on llm social deduction game play, which is nuanced and challenging in different ways, but shares the need for strong memory and robust reasoning in the face of hallucination.
I’d be happy to talk about any of these issues and compare leads!
I’ll let you know. They’re working on open-sourcing their scaffold at the moment.
Text adventures do seem like a good eval right now, since they’re the ONLY games that can be tested without either relying on vision (which is still very bad), or writing a custom harness for each game (in which case your results depend heavily on the harness).