It seems Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%.
I don’t believe there’s a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).
Gemini 2.5 Pro is already was well ahead of O3 on IMO, but had worse swe-bench/METR scores.
Claude is relatively bad at math but has hovered around SOTA on agentic coding.
It seems Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%.
I don’t believe there’s a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).
Gemini 2.5 Pro is already was well ahead of O3 on IMO, but had worse swe-bench/METR scores.
Claude is relatively bad at math but has hovered around SOTA on agentic coding.