o3 beat Pokémon Red today, making it the second model to do so after Gemini 2.5 Pro (technically Gemini beat Blue).
It had an advanced custom harness like Gemini’s, rather than Claude’s basic one. Hard to compare runs because its harness is different from Gemini’s, but Gemini’s most recent run finished in ~406 hours / ~37k actions, whereas o3 finished in ~388 hours / ~18k actions. (there are some differences in how actions are counted) Claude Opus 4 has yet to achieve the 4th badge on its current ~380 hour / 54k actions run, but it’s very likely it could beat the game with an advanced harness.
I really wish someone tried out o3/gemini with a weaker harness (say equal to claude), which is where it would be more interesting and also it would make a cross-model comparison easier.
o3 beat Pokémon Red today, making it the second model to do so after Gemini 2.5 Pro (technically Gemini beat Blue).
It had an advanced custom harness like Gemini’s, rather than Claude’s basic one. Hard to compare runs because its harness is different from Gemini’s, but Gemini’s most recent run finished in ~406 hours / ~37k actions, whereas o3 finished in ~388 hours / ~18k actions. (there are some differences in how actions are counted) Claude Opus 4 has yet to achieve the 4th badge on its current ~380 hour / 54k actions run, but it’s very likely it could beat the game with an advanced harness.
See here for stream
See here for info on harness
I really wish someone tried out o3/gemini with a weaker harness (say equal to claude), which is where it would be more interesting and also it would make a cross-model comparison easier.
The best cross-comparison on same harness info I know of is here.