simulus comments on Measuring no CoT math time horizon (single forward pass)

simulus 27 Dec 2025 4:30 UTC
15 points
5
I’m sure you already know that your method of asking Opus 4.5 for time estimates is the weakest part of your methodology. The thing that really throws me for a loop about it is that Claude models are consistently on the pareto frontier of performance while Anthropic has never seemed the strongest in math.
I would be interested to see how the data changes when you use different models for time estimates, or even average their estimates.
I don’t have a causal reason to expect it, but it would really be fascinating if there is a continued trend of the time estimator family dominating the pareto frontier (ex ChatGPT models are the pareto frontier when ChatGPT is the time estimator).
I would also be curious to see where Kimi K2 stands, since it is among the strongest non-reasoning models and reportedly has a different “feel” than other models (possibly due to the Muon optimizer).