The swe-bench scores are already well below trend from ai 2027. Had to hit 85% by end of month. We’re at 75%. (and SOTA was ~64% when they released ai 2027)
It seems Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%.
I don’t believe there’s a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).
Gemini 2.5 Pro is already was well ahead of O3 on IMO, but had worse swe-bench/METR scores.
Claude is relatively bad at math but has hovered around SOTA on agentic coding.
The swe-bench scores are already well below trend from ai 2027. Had to hit 85% by end of month. We’re at 75%. (and SOTA was ~64% when they released ai 2027)
Gemini 3 should drop by the end of the month, we might hit that
+ 25% for swe-bench relative to Gemini 2.5? Quadrupling the METR task length of Gemini 2.5?
I suppose it’s a possibility, albeit a remote one.
It seems Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%.
I don’t believe there’s a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).
Gemini 2.5 Pro is already was well ahead of O3 on IMO, but had worse swe-bench/METR scores.
Claude is relatively bad at math but has hovered around SOTA on agentic coding.