Simple evidence to the contrary: Sonnet 4.5 is SOTA on SWE bench yet lags notably behind GPT-5 on METR task length (and the difference in SWE bench scores is greater here than the difference between 3.0 pro/sonnet)
Yes, they’re not consistently highly correlated, my guess could be wrong.
Simple evidence to the contrary: Sonnet 4.5 is SOTA on SWE bench yet lags notably behind GPT-5 on METR task length (and the difference in SWE bench scores is greater here than the difference between 3.0 pro/sonnet)
Yes, they’re not consistently highly correlated, my guess could be wrong.