(Credentials: was an IMO team reserve, did some similar competitions)
Have the actual answers AI produced been posted? Because I could see this mattering a lot, or not at all, depending on the exact quality of the answers.
If you give a clean, accurate answer that lines up with the expected proof, grading is quite quick and very easy. But if your proof is messy and non-standard, coordinators need to go through it and determine its validity: or if you missed out part of the proof, there needs to be a standardized answer to ‘how big a gap is this, and how much partial credit do you get?‘
(Also, have the exact prompts used been posted? Because it would be very very easy to add small amounts of text or examples that make these problems much easier. If the prompt used for Q4 contains the number ‘6’ at any point in it, for example, I would basically just instantly call that ‘cheating’).
The OpenAI solutions are written in a weird style (I have seen some people say “RLHFed to death” and some people say “looks like internal chain-of-thought language” and some people say “it’s learned to be brief to get maximum use out of its context window”) but still basically natural mathematical language.
The Google solutions—I don’t know whether this is exactly what the model spat out or whether it’s been through any prettifying postprocessing, but would guess the former—are neatly formatted and written in nice polished English with LaTeX-formatted mathematics.
The ones I’ve looked at (which is not all of them) seem to be genuinely complete solutions, aside from the weird language of the OpenAI ones.
So far as I know, the prompts have not been posted.
(Some other people used ordinary Gemini 2.5 Pro to solve some of the problems—I forget whether they managed all of them other than #6 or not—with a bit more human assistance; I don’t think they posted their exact prompts but they did mention things like saying “Let’s proceed by induction” after the statement of Q1 and “Let’s do this with analytic geometry” after the statement of Q2. For the solutions above, I think OpenAI and Google both claim not to have given any sort of hints,)
I looked at the Q1/4/5 answers[1]. I think they would indeed most likely all get 7s: there’s quite a bit of verbosity, and in particular OpenAI’s Q4 answer spends a lot of time talking its way around in circles, but I believe there’s a valid proof in all of them.
Most interesting is Q1, where OpenAI produces what I think is a very human answer (the same approach I took, and the one I’d expect most human solvers to take) while Google takes a less intuitive approach but one that ends up much neater. This makes me a little bit suspicious about whether some functionally-identical problem showed up somewhere in Google’s training, but if it didn’t that is extra impressive.
IMO Q3 and Q6 are generally much harder: the AI didn’t solve Q6, and I haven’t gone through the Q3 answers. Q2 was a geometry one, which is weirder to look through and which I find very unpleasant.
I’m not convinced that the Google approach to Q1 actually ends up significantly neater than the more obvious one (which was mine too, and like you I’d expect most human solvers to do it that way), but I do think the lemma they prove is a very nice one!
there are WLOG only three options for your single non-sunny line, and you can pretty much draw the diagrams and say “behold!” but even writing out longhand why you can’t cover what’s left in any of those cases with two sunny lines isn’t too painful.
(Credentials: was an IMO team reserve, did some similar competitions)
Have the actual answers AI produced been posted? Because I could see this mattering a lot, or not at all, depending on the exact quality of the answers.
If you give a clean, accurate answer that lines up with the expected proof, grading is quite quick and very easy. But if your proof is messy and non-standard, coordinators need to go through it and determine its validity: or if you missed out part of the proof, there needs to be a standardized answer to ‘how big a gap is this, and how much partial credit do you get?‘
(Also, have the exact prompts used been posted? Because it would be very very easy to add small amounts of text or examples that make these problems much easier. If the prompt used for Q4 contains the number ‘6’ at any point in it, for example, I would basically just instantly call that ‘cheating’).
The answers, at least for OpenAI and Google, have been posted.
OpenAI: https://github.com/aw31/openai-imo-2025-proofs/
Google: https://storage.googleapis.com/deepmind-media/gemini/IMO_2025.pdf
The OpenAI solutions are written in a weird style (I have seen some people say “RLHFed to death” and some people say “looks like internal chain-of-thought language” and some people say “it’s learned to be brief to get maximum use out of its context window”) but still basically natural mathematical language.
The Google solutions—I don’t know whether this is exactly what the model spat out or whether it’s been through any prettifying postprocessing, but would guess the former—are neatly formatted and written in nice polished English with LaTeX-formatted mathematics.
The ones I’ve looked at (which is not all of them) seem to be genuinely complete solutions, aside from the weird language of the OpenAI ones.
So far as I know, the prompts have not been posted.
(Some other people used ordinary Gemini 2.5 Pro to solve some of the problems—I forget whether they managed all of them other than #6 or not—with a bit more human assistance; I don’t think they posted their exact prompts but they did mention things like saying “Let’s proceed by induction” after the statement of Q1 and “Let’s do this with analytic geometry” after the statement of Q2. For the solutions above, I think OpenAI and Google both claim not to have given any sort of hints,)
Here’s the paper (with the prompts) for Gemini 2.5 Pro solving the first 5 questions with the hints. https://arxiv.org/abs/2507.15855
I looked at the Q1/4/5 answers[1]. I think they would indeed most likely all get 7s: there’s quite a bit of verbosity, and in particular OpenAI’s Q4 answer spends a lot of time talking its way around in circles, but I believe there’s a valid proof in all of them.
Most interesting is Q1, where OpenAI produces what I think is a very human answer (the same approach I took, and the one I’d expect most human solvers to take) while Google takes a less intuitive approach but one that ends up much neater. This makes me a little bit suspicious about whether some functionally-identical problem showed up somewhere in Google’s training, but if it didn’t that is extra impressive.
IMO Q3 and Q6 are generally much harder: the AI didn’t solve Q6, and I haven’t gone through the Q3 answers. Q2 was a geometry one, which is weirder to look through and which I find very unpleasant.
I’m not convinced that the Google approach to Q1 actually ends up significantly neater than the more obvious one (which was mine too, and like you I’d expect most human solvers to do it that way), but I do think the lemma they prove is a very nice one!
I think the obvious approach is comparably neat until you get to the point of proving
that k=2 won’t work
at which point it’s a mess. The Google approach manages to prove that part in a much nicer way as a side effect of its general result.
The obvious approach lets you
reduce to the case of a size-3 triangle
at which point
there are WLOG only three options for your single non-sunny line, and you can pretty much draw the diagrams and say “behold!” but even writing out longhand why you can’t cover what’s left in any of those cases with two sunny lines isn’t too painful.