Much of the gains on SWE Bench are actually about having the model find better context via tool calls. Sonnet 3.7 is trained to seek out the information it needs.
But if you compare the models with fixed context, they are only somewhat smarter than before.
(The other dimension is the thinking models which seem to be only a modest improvement on coding, but do much better at math for example.)
That being said, the new Gemini 2.5 Pro seems like another decent step up in intelligence from Sonnet 3.7. We’re about to switch out the default mode of our coding agent, Codebuff, to use it (and already shipped it for codebuff—max).
Much of the gains on SWE Bench are actually about having the model find better context via tool calls. Sonnet 3.7 is trained to seek out the information it needs.
But if you compare the models with fixed context, they are only somewhat smarter than before.
(The other dimension is the thinking models which seem to be only a modest improvement on coding, but do much better at math for example.)
That being said, the new Gemini 2.5 Pro seems like another decent step up in intelligence from Sonnet 3.7. We’re about to switch out the default mode of our coding agent, Codebuff, to use it (and already shipped it for codebuff—max).