Important caveat: To get human completion times, I ask Opus 4.5 (with thinking) to estimate how long it would take the median AIME participant to complete a given problem. These times seem roughly reasonable to me, but getting some actual human baselines and using these to correct Opus 4.5′s estimates would be better.
I’m sorry but what? That’s not just a caveat, that makes the rest of this analysis close to meaningless!
You can’t say you’re measuring LLM progress if the goalposts are also being being placed by an LLM. For all you know, you’re just measuring how hard LLMs are affected by some LLM-specific idiosyncracy, with little to no relation to how hard it would be for a human to actually solve the problem.
I’m sorry but what? That’s not just a caveat, that makes the rest of this analysis close to meaningless!
You can’t say you’re measuring LLM progress if the goalposts are also being being placed by an LLM. For all you know, you’re just measuring how hard LLMs are affected by some LLM-specific idiosyncracy, with little to no relation to how hard it would be for a human to actually solve the problem.