Because Gemini 3 is not SOTA on SWE-bench Verified I guess that it’s METR task length will not continue the exponential growth trend over GPT-5, and even odds whether it even achieves SOTA.
Simple evidence to the contrary: Sonnet 4.5 is SOTA on SWE bench yet lags notably behind GPT-5 on METR task length (and the difference in SWE bench scores is greater here than the difference between 3.0 pro/sonnet)
Because Gemini 3 is not SOTA on SWE-bench Verified I guess that it’s METR task length will not continue the exponential growth trend over GPT-5, and even odds whether it even achieves SOTA.
Simple evidence to the contrary: Sonnet 4.5 is SOTA on SWE bench yet lags notably behind GPT-5 on METR task length (and the difference in SWE bench scores is greater here than the difference between 3.0 pro/sonnet)
Yes, they’re not consistently highly correlated, my guess could be wrong.