We didn’t dive into that too much because it’s covered in depth by previous posts (see footnote 1), but TL;DR even for Claude, the harness has changed fairly frequently. There is a partial changelog linked at the beginning of the “Harness Changes for 4.7” section. Pokémon is not a rigorous benchmark; it’s too long and expensive to run. We don’t know for certain how earlier versions would do in this version of the harness.
But that said, some recent changes to the harness have also weakened it: there used to be a “Critic Claude” (“CC”) that would regularly analyze the main Claude’s actions and try to keep it from getting stuck (referenced in the youtube video), there used to be hints to help it understand the visuals better, the overall prompt used to give more general advice, memory files and button presses were previously limited to help prevent mistakes, etc.
Other info from the announcement worth mentioning:
was a general model, not specialized, they were just testing it on random Erdos problems
key trick seemingly was applying algebraic number theory to geometry in an unexpected way