I’m sure you already know that your method of asking Opus 4.5 for time estimates is the weakest part of your methodology. The thing that really throws me for a loop about it is that Claude models are consistently on the pareto frontier of performance while Anthropic has never seemed the strongest in math.
I would be interested to see how the data changes when you use different models for time estimates, or even average their estimates.
I don’t have a causal reason to expect it, but it would really be fascinating if there is a continued trend of the time estimator family dominating the pareto frontier (ex ChatGPT models are the pareto frontier when ChatGPT is the time estimator).
I would also be curious to see where Kimi K2 stands, since it is among the strongest non-reasoning models and reportedly has a different “feel” than other models (possibly due to the Muon optimizer).
I’m sure you already know that your method of asking Opus 4.5 for time estimates is the weakest part of your methodology. The thing that really throws me for a loop about it is that Claude models are consistently on the pareto frontier of performance while Anthropic has never seemed the strongest in math.
I would be interested to see how the data changes when you use different models for time estimates, or even average their estimates.
I don’t have a causal reason to expect it, but it would really be fascinating if there is a continued trend of the time estimator family dominating the pareto frontier (ex ChatGPT models are the pareto frontier when ChatGPT is the time estimator).
I would also be curious to see where Kimi K2 stands, since it is among the strongest non-reasoning models and reportedly has a different “feel” than other models (possibly due to the Muon optimizer).