Cole Wyeth comments on Cole Wyeth’s Shortform

Cole Wyeth 18 Nov 2025 23:11 UTC
5 points
−6
Because Gemini 3 is not SOTA on SWE-bench Verified I guess that it’s METR task length will not continue the exponential growth trend over GPT-5, and even odds whether it even achieves SOTA.
- jamjam 19 Nov 2025 0:51 UTC
  2 points
  0
  Parent
  Simple evidence to the contrary: Sonnet 4.5 is SOTA on SWE bench yet lags notably behind GPT-5 on METR task length (and the difference in SWE bench scores is greater here than the difference between 3.0 pro/sonnet)
  - Cole Wyeth 19 Nov 2025 2:30 UTC
    2 points
    0
    Parent
    Yes, they’re not consistently highly correlated, my guess could be wrong.