The headline result was obviously going to happen, not an update for anyone paying attention.
The other claims are interesting. GPT-5 release will be a valuable data point and will allow us to evaluate the claim that this reasoning training was not task specific.
I don’t know if LLMs will become autonomous math researchers, but it seems likely to happen before other kinds of agency, since it has the best feedback loops and perhaps is just best suited to text-based reasoning. Might mean that I’m out of a job.
The headline result was obviously going to happen, not an update for anyone paying attention.
“Obviously going to happen” is very different from ‘happens at this point in time rather than later or sooner and with this particular announcement by this particular company’. You should still update off this. Hell, I was pretty confident this would be first done by Google DeepMind, so its a large update for me (I don’t know what for yet though)!
Your claim “not an update for anyone paying attention” also seems false. I’m sure there are many who are updating off this who were paying attention, for whatever reason, as they likely should.
I generally dislike this turn of phrase as it serves literally no purpose but to denigrate people who are changing their mind in light of evidence, which is just a bad thing to do.
Don’t have the link, but it seems DeepMind researchers on X have tacitly confirmed they had already reached gold. What we don’t know is whether it was done with a general LLM like OAI or a narrower one.
I think it was reasonable to expect GDM to achieve gold with an AlphaProof-like system. Achieving gold with a general LLM-reasoning system from GDM would be something else and it is important for discussion around this to not confuse one forecast for another. (Not saying you are, but that in general it is hard to tell which claim people are putting forward.)
Official results are in—Gemini achieved gold-medal level in the International Mathematical Olympiad! 🏆 An advanced version was able to solve 5 out of 6 problems. Incredible progress—huge congrats to @lmthang and the team! deepmind.google/discover/blo…
We achieved this year’s impressive result using an advanced version of Gemini Deep Think (an enhanced reasoning mode for complex problems). Our model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit! We’ll be making a version of this Deep Think model available to a set of trusted testers, including mathematicians, before rolling it out to Google AI Ultra subscribers.
Btw as an aside, we didn’t announce on Friday because we respected the IMO Board’s original request that all AI labs share their results only after the official results had been verified by independent experts & the students had rightly received the acclamation they deserved
We’ve now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!
I don’t believe anyone was forecasting this result, no.
EDIT: Clarifying—many forecasts made no distinction whether an AI model had a major formal method component like AlphaProof or not. I’m drawing attention to the fact that the two situations are distinct and require distinct updates. What those are, I’m not sure yet.
I generally dislike this turn of phrase as it serves literally no purpose but to denigrate people who are changing their mind in light of evidence, which is just a bad thing to do.
Well, fair enough, but I did specify that the surrounding context was an update.
You said “The other claims are interesting” which maybe could include “this particular announcement”, but not “at this point in time rather than later or sooner” or “by this particular company”. I also object on the grounds that the “headline result” is not “not an update for anyone paying attention”. To give proof, see this manifold market, which before the release of this model was at like 40%.
So the market was previously around 85%, and then it went down as we got further through the year. I guess this proves that many people didn’t expect it to happen in the next few months. The question wasn’t really load bearing for my models, and you’re right that I am not particularly interested that it happened at this point in time or by this particular company.
Doesn’t seem like it’ll be very informative about this, given the OP’s: “Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.”
I thought about this sort of thing (adversarial robust augmentation) and decided it would be very hard to do it safely with something smarter than you.
However, there may in fact be a window where LLMs are good at math but not agency and they can be used to massively accelerate agent foundations research.
agent foundations research is what I’m talking about, yup. what do you ask the AI to make significant progress on agent foundations and be sure you did so correctly? are there questions we can ask where, even if we don’t know the entire theorem we want to ask for a proof of, we can show there aren’t many ways to fill in the whole theorem that could be of interest, so that we could, eg, ask an AI to enumerate what theorems could have a combination of agency-relevant properties? something like that. I’ve been procrastinating on making a whole post pitching this because I myself am not sure the idea has merit, but maybe there’s something to be done here, and if there is it seems like it could be a huge deal. it might be possible to ask for significantly more complicated math to be solved, so maybe if you can frame it as something where you’re looking for plausible compressions, or simplifications or generalizations of an expression, or something.
The headline result was obviously going to happen, not an update for anyone paying attention.
I agree with this comment but am kinda surprised you are the one saying it. I realize this isn’t that strong an argument for “LLMs are actually good” but it happening about-now as opposed to like 6 months later seems like more evidence for them eventually being able to reliably to novel intellectual work.
So, at least for me, this was apparently priced in—worth mentioning since your comment seems to possibly imply I should not have expected this a priori.
(To be fair, it must have been a pretty good deal when I bought, something like 20%)
Well that is a cool thing to have on record. I believe you. :)
At the time did you hold mostly the same “it’s going to hit some kind of creativity / innovation wall eventually” beliefs? (or, however you’d summarize your take, I’m not 100% clear on it)
It’s the type of problem I expected LLMs to be able to solve—challenging proofs wirh routine techniques, probably no novel math concepts invented.
if it’s in the envelope of the achievable with current techniques, the time to get there seems more a function of prioritization and developer skill, and not evidence about the limits of the paradigm.
I guess it is a small update though—these longer proofs may require some agency.
The headline result was obviously going to happen, not an update for anyone paying attention.
The other claims are interesting. GPT-5 release will be a valuable data point and will allow us to evaluate the claim that this reasoning training was not task specific.
I don’t know if LLMs will become autonomous math researchers, but it seems likely to happen before other kinds of agency, since it has the best feedback loops and perhaps is just best suited to text-based reasoning. Might mean that I’m out of a job.
“Obviously going to happen” is very different from ‘happens at this point in time rather than later or sooner and with this particular announcement by this particular company’. You should still update off this. Hell, I was pretty confident this would be first done by Google DeepMind, so its a large update for me (I don’t know what for yet though)!
Your claim “not an update for anyone paying attention” also seems false. I’m sure there are many who are updating off this who were paying attention, for whatever reason, as they likely should.
I generally dislike this turn of phrase as it serves literally no purpose but to denigrate people who are changing their mind in light of evidence, which is just a bad thing to do.
Don’t have the link, but it seems DeepMind researchers on X have tacitly confirmed they had already reached gold. What we don’t know is whether it was done with a general LLM like OAI or a narrower one.
I think it was reasonable to expect GDM to achieve gold with an AlphaProof-like system. Achieving gold with a general LLM-reasoning system from GDM would be something else and it is important for discussion around this to not confuse one forecast for another. (Not saying you are, but that in general it is hard to tell which claim people are putting forward.)
It seems your forecast here was wrong
I don’t believe anyone was forecasting this result, no.
EDIT: Clarifying—many forecasts made no distinction whether an AI model had a major formal method component like AlphaProof or not. I’m drawing attention to the fact that the two situations are distinct and require distinct updates. What those are, I’m not sure yet.
Oh I see, yeah that makes sense.
Well, fair enough, but I did specify that the surrounding context was an update.
You said “The other claims are interesting” which maybe could include “this particular announcement”, but not “at this point in time rather than later or sooner” or “by this particular company”. I also object on the grounds that the “headline result” is not “not an update for anyone paying attention”. To give proof, see this manifold market, which before the release of this model was at like 40%.
So the market was previously around 85%, and then it went down as we got further through the year. I guess this proves that many people didn’t expect it to happen in the next few months. The question wasn’t really load bearing for my models, and you’re right that I am not particularly interested that it happened at this point in time or by this particular company.
Doesn’t seem like it’ll be very informative about this, given the OP’s: “Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.”
I’m not going to take that too seriously unless GPT-5 performs well.
it means you’d better figure out how to ask the right questions to a math-oracle LLM to finish your job quickly and be sure you did it right.
I thought about this sort of thing (adversarial robust augmentation) and decided it would be very hard to do it safely with something smarter than you.
However, there may in fact be a window where LLMs are good at math but not agency and they can be used to massively accelerate agent foundations research.
agent foundations research is what I’m talking about, yup. what do you ask the AI to make significant progress on agent foundations and be sure you did so correctly? are there questions we can ask where, even if we don’t know the entire theorem we want to ask for a proof of, we can show there aren’t many ways to fill in the whole theorem that could be of interest, so that we could, eg, ask an AI to enumerate what theorems could have a combination of agency-relevant properties? something like that. I’ve been procrastinating on making a whole post pitching this because I myself am not sure the idea has merit, but maybe there’s something to be done here, and if there is it seems like it could be a huge deal. it might be possible to ask for significantly more complicated math to be solved, so maybe if you can frame it as something where you’re looking for plausible compressions, or simplifications or generalizations of an expression, or something.
I agree with this comment but am kinda surprised you are the one saying it. I realize this isn’t that strong an argument for “LLMs are actually good” but it happening about-now as opposed to like 6 months later seems like more evidence for them eventually being able to reliably to novel intellectual work.
By the way, I actually bet that the IMO gold would fall in 2025 and made a small (but very high percentage!) profit: https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat#
#537 among “top traders,” made 106 mana.
So, at least for me, this was apparently priced in—worth mentioning since your comment seems to possibly imply I should not have expected this a priori.
(To be fair, it must have been a pretty good deal when I bought, something like 20%)
Well that is a cool thing to have on record. I believe you. :)
At the time did you hold mostly the same “it’s going to hit some kind of creativity / innovation wall eventually” beliefs? (or, however you’d summarize your take, I’m not 100% clear on it)
It’s the type of problem I expected LLMs to be able to solve—challenging proofs wirh routine techniques, probably no novel math concepts invented.
if it’s in the envelope of the achievable with current techniques, the time to get there seems more a function of prioritization and developer skill, and not evidence about the limits of the paradigm.
I guess it is a small update though—these longer proofs may require some agency.