So, speaking specifically about IMO Gold, OpenAI has not released a configuration capable of achieving IMO Gold yet, and it seems that Gemini configuration capable of achieving IMO Gold is still only available to a select group of testers (including some mathematicians)[[1]].
So, unless I am mistaken, on the “informal track”, DeepSeek is not just the first IMO Gold capable system available as open weights, but the first IMO Gold capable system publicly available at all.
On the formal, Lean-oriented “track”, it might be that the publicly available version of Aristotle from Harmonic is good enough now (when its experimental version was made initially available in the Summer, it did not seem very strong, but it should be much better now, “Aristotle: IMO-level Automated Theorem Proving”, https://arxiv.org/abs/2510.01346).
https://blog.google/products/gemini/gemini-3-deep-think/ which came out yesterday says: “Gemini 3 Deep Think is industry leading on rigorous benchmarks like Humanity’s Last Exam (41.0% without the use of tools) and ARC-AGI-2 (an unprecedented 45.1% with code execution). This is because it uses advanced parallel reasoning to explore multiple hypotheses simultaneously — building on Gemini 2.5 Deep Think variants that recently achieved a gold-medal standard at the International Mathematical Olympiad and at the International Collegiate Programming Contest World Finals.” It’s a bit ambiguous, they say they are using the same technique, but it’s not clear if this publicly available configuration can achieve results which are this high.
Unfortunately, Gemini 3 Pro without the Deep Think optionmanaged to one-shot the problems 1,3,4,5 of the IMO 2025. I doubt that if we prompt the system to solve the problems one by one, then we won’t obtain the result of solving all but the Problem 6.
EDIT: fortunately, solving all the problems was a failure. Unfortunately, prompting the model to solve them one by one saw a great success of solving the problems 1, 3 and 5, failing the problem 2, encountered an error doing Problem 4.
That’s good (assuming no contamination, of course (I don’t expect it to break instructions not to search, but it could have seen them at some of the training phases)).
But this will be possible to double-check in the future with novel problems.
(I assume someone checked the correctness of these versions of solutions; this is just a conversation, but someone needs to assert checking the details.)
Thanks for the overview!
So, speaking specifically about IMO Gold, OpenAI has not released a configuration capable of achieving IMO Gold yet, and it seems that Gemini configuration capable of achieving IMO Gold is still only available to a select group of testers (including some mathematicians) [[1]] .
So, unless I am mistaken, on the “informal track”, DeepSeek is not just the first IMO Gold capable system available as open weights, but the first IMO Gold capable system publicly available at all.
On the formal, Lean-oriented “track”, it might be that the publicly available version of Aristotle from Harmonic is good enough now (when its experimental version was made initially available in the Summer, it did not seem very strong, but it should be much better now, “Aristotle: IMO-level Automated Theorem Proving”, https://arxiv.org/abs/2510.01346).
https://blog.google/products/gemini/gemini-3-deep-think/ which came out yesterday says: “Gemini 3 Deep Think is industry leading on rigorous benchmarks like Humanity’s Last Exam (41.0% without the use of tools) and ARC-AGI-2 (an unprecedented 45.1% with code execution). This is because it uses advanced parallel reasoning to explore multiple hypotheses simultaneously — building on Gemini 2.5 Deep Think variants that recently achieved a gold-medal standard at the International Mathematical Olympiad and at the International Collegiate Programming Contest World Finals.” It’s a bit ambiguous, they say they are using the same technique, but it’s not clear if this publicly available configuration can achieve results which are this high.
Unfortunately, Gemini 3 Pro without the Deep Think option managed to one-shot the problems 1,3,4,5 of the IMO 2025. I doubt that if we prompt the system to solve the problems one by one, then we won’t obtain the result of solving all but the Problem 6.
EDIT: fortunately, solving all the problems was a failure. Unfortunately, prompting the model to solve them one by one saw a great success of solving the problems 1, 3 and 5, failing the problem 2, encountered an error doing Problem 4.
That’s good (assuming no contamination, of course (I don’t expect it to break instructions not to search, but it could have seen them at some of the training phases)).
But this will be possible to double-check in the future with novel problems.
(I assume someone checked the correctness of these versions of solutions; this is just a conversation, but someone needs to assert checking the details.)