Thanks for following through.
If anyone wants to make a proper harness in the future, I think probably the most interesting question here is if the LLM can learn from multiple playthroughs, unlocking harder difficulties, etc.
Modern LLMs, maybe through notetaking?
The models have always been deeply familiar with Pokémon and how to play through it from the initial tests with Sonnet 3.7—they all know Erika is the fourth gym leader in Red, there’s just too much internet text about this stuff. It contaminates the test, from a certain perspective, but it also makes failures and weaknesses even more apparent.
It is possible that Claude Opus 4.5 alone was trained on more Pokémon images as part of its general image training (more than Sonnet 4.5, though...?), but it wouldn’t really matter: pure memorization would not have helped previous models, because they couldn’t clearly see/understand stuff they definitely knew about. (I also doubt Anthropic is benchmaxxing Pokémon, considering they’ve kept their harness limited even after Google and OpenAI beat them on their own benchmark.)