It’s interesting how much of this was computer vision problems. Inability to look at the screen and determine which set of pixels is the stairs or the complete inability to differentiate cutable tree’s from ones that cannot be cut. That part at least seems like the kind of problem that would go away in a year if significant effort was devoted toward the problem.
I find it fascinating how this set of children’s video games from the 90s does a better job of showing off my frustrations with large language models than anything else. When you give them small concrete narrow tasks and can reliably test their output they are incredibly useful, (ex They are superhuman at helping you write small functions in code)but do not try to get them to do a long context task that you can’t test intermediate steps on. (after all if you could test intermediate steps you can break the task down until you get to the smallest intermediate step and prompt the model with that step). The Hallucination problem is a lot more clear when playing pokemon than anything else, and it’s much more clear about random issues in agenic ability.
The inability of models to have memory is really the major frustration currently preventing models from being used in longer contexts. Pokemon as a benchmark is in theory a 2-3 hour (sub 2 is possible but takes a lot of resets) long task from start to finish if you don’t waste any time.
I’ll say this much
Rainbolt tier LLMs already exist https://geobench.org/
AI’s trained on Geoguessr are dramatically better than rainbolt and have been for years