GPT-5.1 beating crystal in 108 hours is very interesting. I wonder why that’s the case compared to Gemini 3 Pro, which took ~424.5 hours. Do you have any thoughts?
GPT-5.1 harness is stronger, in particular it has better prompts (value from iterated prompt-writing should not be underestimated here)
The two developers have different goals and approaches—Gemini developer has trended towards letting the LLM make its own tools and play the game at its own speed, while GPT developer pushes the LLM to play efficiently and beat the game quickly
GPT-5.1 is being run in “continuous thinking mode” which in practice means it wastes less time and compute on simple tasks and thinks harder to get difficult problems right
Unfortunately no one has done full playthrough comparisons on the same harness for all models, due to time and expense. (all three main developers for Claude/Gemini/GPT only have access to free tokens for their particular model brand) Perhaps this will become possible sometime next year as completion time drops? (cost per token might drop too, but perhaps not for frontier models)
GPT-5.1 beating crystal in 108 hours is very interesting. I wonder why that’s the case compared to Gemini 3 Pro, which took ~424.5 hours. Do you have any thoughts?
Bunch of reasons:
GPT-5.1 harness is stronger, in particular it has better prompts (value from iterated prompt-writing should not be underestimated here)
The two developers have different goals and approaches—Gemini developer has trended towards letting the LLM make its own tools and play the game at its own speed, while GPT developer pushes the LLM to play efficiently and beat the game quickly
GPT-5.1 is being run in “continuous thinking mode” which in practice means it wastes less time and compute on simple tasks and thinks harder to get difficult problems right
Unfortunately no one has done full playthrough comparisons on the same harness for all models, due to time and expense. (all three main developers for Claude/Gemini/GPT only have access to free tokens for their particular model brand) Perhaps this will become possible sometime next year as completion time drops? (cost per token might drop too, but perhaps not for frontier models)