MrCheeze comments on Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

MrCheeze 21 Apr 2025 12:36 UTC
24 points
3
I have not tested if Gemini can distinguish this tree (and intend to eventually). This may very well be the only reason Gemini has progressed further.
You missed an important fact about the Gemini stream, which is that it just reads the presence of these trees from RAM and labels them for the model (along with a few other special tiles like ledges and water). Nevertheless I do think Gemini’s vision is better, by which I mean if you provide it a screenshot it will sometimes identify the correct tree, unlike Claude who will never do so. (Although to my knowledge the Gemini in the stream has literally never used vision for anything.) And in general the Gemini streamer is far more liberal about updating the scaffolding to address challenges than the Claude streamer is.
Also there’s one other reason that Gemini has gotten farther: it simply has the whole walkthrough of the game memorized, while Claude doesn’t know what to do after the thunderbadge. (I don’t think either model would be remotely competent on RPGs that aren’t in the training data.)
This doesn’t mean memory is not a problem. The problems are just more subtle than one might imagine. For instance, the lack of direct memory means models lack a real sense of time, or how hard a task is. That means even when given a notepad to record observations, they will not consistently record “HOW TO SOLVE THAT PUZZLE THAT TOOK FOREVER” because they don’t realize it took forever. And of course if it’s not written down it falls completely out of “long-term” memory.
This has been a recurring problem with the Claude stream, where the model is given the ability to take notes. Whenever he’s struggling and failing to solve a problem for a long time, he’ll endlessly write notes about his (wrong) ideas for what to do, reinforcing that behaviour. When he finally tries the right thing, it seems like it was easy, so you MIGHT get one note written down about it. If you’re lucky.
In general, however incompetent this post makes it sound like the models are at playing the game, they’re even worse than that. I feel like this is in large part because of LLMs having frozen weights—every single mistake that they make will be repeated every time the situation reoccurs, instead of just once as a human would do. Taking notes doesn’t help this very much, as their basic instincts being wrong seems to make far more difference than what’s in their notes.