Dataset contamination is a big problem with these kinds of bets. The sorts of models I expect people to use to win an IMO will probably have seen those IMO questions before, or extremely similair ones. Also, I don’t buy that winning an IMO means AI “beat humans at math”. Mathematics research, on the level of proving a major theorem autonomosly, requires quite different capabilities than winning an IMO.
Still, I’d guess there’s maybe a 30-35% chance of an AI, NOT trained on prior IMO/maths contest questions, winning an IMO by 2026.
This question resolves on the date an AI system competes well enough on an IMO test to earn the equivalent of a gold medal. The IMO test must be most current IMO test at the time the feat is completed (previous years do not qualify).”The IMO test must be most current IMO test at the time the feat is completed (previous years do not qualify).”
I think this was defined on purpose to avoid such contamination. It also seems common sense to me that, when training a system to perform well on IMO 2026, you cannot include any data point from after the questions were made public.
At the same time training on previous IMO/math contest questions should be fair game. All human contestants practice quite a lot on questions from previous contents, and IMO is still very challenging for them.
I dunno, I think there are a LOT of old olympiad problems—not just all the old IMOs but also all the old national-level tests from every country that publishes them. (Bottom section here.) I think that even the most studious humans only study a small fraction of existing problems, I think. Like, if someone literally read every olympiad-level problem and solution ever published, then went to a new IMO, I would expect them to find that at least a couple of the problems were sufficiently similar to something they’ve seen that they could get the answer without too much creativity. (That’s just a guess, not really based on anything.)
(That’s not enough for a gold by itself, but could be part of the plan, in conjunction with special-case AIs for particular common types of problems, and self-play-proof-assistant things, etc.)
I know a guy from the Physics Olympiads that was a mobile library of past olympiads problems. I think you’re underestimating the level of weirdness you can find around. Maybe it’s still a fraction of the existing problems, but I’d estimate enough to cover non-redundant ones.
I would expect them to find that at least a couple of the problems were sufficiently similar to something they’ve seen that they could get the answer without too much creativity.
I’ve not been to the IMO but I’d bet this already happens from comments I overheard by people who have been.
I see about ~100 book in there. I met several IMO gold-medal winners and I expect most of them to have read dozens of these books, or the equivalent in other forms. I know one who has read tens of olympiad-level books in geometry alone!
And yes, you’re right that they would often pick one or two problems as similar to what they had seen in the past, but I suspect these problems still require a lot of reasoning even after the analogy has been established. I may be wrong, though.
We can probably inform this debate by getting the latest IMO and creating a contest for people to find which existing problems are the most similar to those in the exam. :)
Eh, there are not that many IMO problems, even including shortlisted problems. Since there are not that many, IMO contestants basically solve all previous IMO problems to practice. So it’s not like AI is having an unfair advantage.
I am of the opinion that adding the condition of “not trained on prior IMO/math contest problems” is ridiculous.
IMO problem solving (the ones you need for gold specifically) is much closer to research math than high school math. Generalizing from some IMO problems to others would be as impressive as starting from scratch
I kind of disagree. (I was on South Korean IMO team.) I agree IMO problems are in similar category of tasks including research math than high school math, but since IMO problems are intended to be solvable within a time limit, there is (quite low, in absolute sense) upper limit to their difficulty. Basically, intended solution is not longer than a single page. Research math problems have no such limit and can be arbitrarily difficult, or have a solution arbitrarily long.
Edit: Apart from time limit, length limit, and difficulty limit, another important aspect is that IMO problems are already solved, so known to be solvable. IMO problems are “Prove X”. Research math problems, even if they are stated as “Prove X”, is really “Prove or disprove X”, and sometimes this matters.
“Prove or disprove X” is only like 2x harder than “Prove X.” Sometimes the gap is larger for humans because of psychological difficulties, but a machine can literally just pursue both in parallel. (That said, research math involves a ton of problems other than prove or disprove.)
I basically agree that IMO problems are significantly easier than research math or other realistic R&D tasks. However I think that they are very much harder than the kinds of test questions that machines have solved so far. I’m not sure the difference is about high school math vs research math so much as about very easy problems vs problems designed to be challenging and require novel thinking.
My view, having spent a fair amount of time on IMO problems as well as on theoretical research and more practical R&D, is that the IMO is significantly easier but just not very far away from the kind of work human scientists need to do in order to be productive.
I think the biggest remaining difference is that the hardest research math problems operate over a timescale about 2-3 orders of magnitude longer than IMO problems, and I would guess transformative R&D requires operating over a timescale somewhere in between. (While IMO problems are themselves about 2-3 orders of magnitude longer for humans than questions that you can solve automatically.)
Research problems also involve a messier set of data and so training on “all IMO problems” is more like getting good at an incredibly narrow form of R&D. And I do think it’s just cognitively harder, but by an amount that feels like much less than a GPT-3 to GPT-4 sized gap.
I’d be personally surprised if you couldn’t close the gap between IMO gold and transformative R&D with 3-4 orders of magnitude of compute (or equivalent algorithmic progress) + an analogous effort to construct relevant data and feedback for particular R&D tasks. If we got an IMO gold in 2023 I would intuitively expect transformative AI to happen well before 2030, and I would shift my view from focusing more on compute to focusing more on data and adapting R&D workflows to benefit from AI.
At least in certain areas of mathematics, research problems are often easier than the harder IMO problems. That is to say, you can get pretty far if you know a lot of previously proven results and combine them in relatively straightforward ways. This seems especially true in areas where it is hard for a single human to know a lot of results, just because it takes a long time to read and learn things.
In the MIRI dialogues from 2021/2022 I thought you said you would update to 40% of AGI by 2040 if AI got an IMO gold medal by 2025? Did I misunderstand or have you shifted your thinking (if so, how?)
I agree timescale is a good way to think about this. My intuition is if high school math problems are 1 then IMO math problems are 100(1e2) and typical research math problems are 10,000(1e4). So exactly half way! I don’t have first hand experience with hardest research math problems, but from what I heard about timescale they seem to reach 1,000,000(1e6). I’d rate typical practical R&D problems 1e3 and transformative R&D problems 1e5.
Edit: Using this scale, I rate GPT-3 at 1 and GPT-4 at 10. This suggests GPT-5 for IMO, which feels uncomfortable to me! Thinking about this, I think while there are lots of 1-data and 10-data, there are considerably less 100-data and above that most things are not written down. But maybe that is an excuse and it doesn’t matter.
Dataset contamination is a big problem with these kinds of bets. The sorts of models I expect people to use to win an IMO will probably have seen those IMO questions before, or extremely similair ones. Also, I don’t buy that winning an IMO means AI “beat humans at math”. Mathematics research, on the level of proving a major theorem autonomosly, requires quite different capabilities than winning an IMO.
Still, I’d guess there’s maybe a 30-35% chance of an AI, NOT trained on prior IMO/maths contest questions, winning an IMO by 2026.
From Metaculus’ resolution criteria:
I think this was defined on purpose to avoid such contamination. It also seems common sense to me that, when training a system to perform well on IMO 2026, you cannot include any data point from after the questions were made public.
At the same time training on previous IMO/math contest questions should be fair game. All human contestants practice quite a lot on questions from previous contents, and IMO is still very challenging for them.
I dunno, I think there are a LOT of old olympiad problems—not just all the old IMOs but also all the old national-level tests from every country that publishes them. (Bottom section here.) I think that even the most studious humans only study a small fraction of existing problems, I think. Like, if someone literally read every olympiad-level problem and solution ever published, then went to a new IMO, I would expect them to find that at least a couple of the problems were sufficiently similar to something they’ve seen that they could get the answer without too much creativity. (That’s just a guess, not really based on anything.)
(That’s not enough for a gold by itself, but could be part of the plan, in conjunction with special-case AIs for particular common types of problems, and self-play-proof-assistant things, etc.)
I know a guy from the Physics Olympiads that was a mobile library of past olympiads problems. I think you’re underestimating the level of weirdness you can find around. Maybe it’s still a fraction of the existing problems, but I’d estimate enough to cover non-redundant ones.
I’ve not been to the IMO but I’d bet this already happens from comments I overheard by people who have been.
I see about ~100 book in there. I met several IMO gold-medal winners and I expect most of them to have read dozens of these books, or the equivalent in other forms. I know one who has read tens of olympiad-level books in geometry alone!
And yes, you’re right that they would often pick one or two problems as similar to what they had seen in the past, but I suspect these problems still require a lot of reasoning even after the analogy has been established. I may be wrong, though.
We can probably inform this debate by getting the latest IMO and creating a contest for people to find which existing problems are the most similar to those in the exam. :)
Eh, there are not that many IMO problems, even including shortlisted problems. Since there are not that many, IMO contestants basically solve all previous IMO problems to practice. So it’s not like AI is having an unfair advantage.
I am of the opinion that adding the condition of “not trained on prior IMO/math contest problems” is ridiculous.
IMO problem solving (the ones you need for gold specifically) is much closer to research math than high school math. Generalizing from some IMO problems to others would be as impressive as starting from scratch
I kind of disagree. (I was on South Korean IMO team.) I agree IMO problems are in similar category of tasks including research math than high school math, but since IMO problems are intended to be solvable within a time limit, there is (quite low, in absolute sense) upper limit to their difficulty. Basically, intended solution is not longer than a single page. Research math problems have no such limit and can be arbitrarily difficult, or have a solution arbitrarily long.
Edit: Apart from time limit, length limit, and difficulty limit, another important aspect is that IMO problems are already solved, so known to be solvable. IMO problems are “Prove X”. Research math problems, even if they are stated as “Prove X”, is really “Prove or disprove X”, and sometimes this matters.
“Prove or disprove X” is only like 2x harder than “Prove X.” Sometimes the gap is larger for humans because of psychological difficulties, but a machine can literally just pursue both in parallel. (That said, research math involves a ton of problems other than prove or disprove.)
I basically agree that IMO problems are significantly easier than research math or other realistic R&D tasks. However I think that they are very much harder than the kinds of test questions that machines have solved so far. I’m not sure the difference is about high school math vs research math so much as about very easy problems vs problems designed to be challenging and require novel thinking.
My view, having spent a fair amount of time on IMO problems as well as on theoretical research and more practical R&D, is that the IMO is significantly easier but just not very far away from the kind of work human scientists need to do in order to be productive.
I think the biggest remaining difference is that the hardest research math problems operate over a timescale about 2-3 orders of magnitude longer than IMO problems, and I would guess transformative R&D requires operating over a timescale somewhere in between. (While IMO problems are themselves about 2-3 orders of magnitude longer for humans than questions that you can solve automatically.)
Research problems also involve a messier set of data and so training on “all IMO problems” is more like getting good at an incredibly narrow form of R&D. And I do think it’s just cognitively harder, but by an amount that feels like much less than a GPT-3 to GPT-4 sized gap.
I’d be personally surprised if you couldn’t close the gap between IMO gold and transformative R&D with 3-4 orders of magnitude of compute (or equivalent algorithmic progress) + an analogous effort to construct relevant data and feedback for particular R&D tasks. If we got an IMO gold in 2023 I would intuitively expect transformative AI to happen well before 2030, and I would shift my view from focusing more on compute to focusing more on data and adapting R&D workflows to benefit from AI.
At least in certain areas of mathematics, research problems are often easier than the harder IMO problems. That is to say, you can get pretty far if you know a lot of previously proven results and combine them in relatively straightforward ways. This seems especially true in areas where it is hard for a single human to know a lot of results, just because it takes a long time to read and learn things.
In the MIRI dialogues from 2021/2022 I thought you said you would update to 40% of AGI by 2040 if AI got an IMO gold medal by 2025? Did I misunderstand or have you shifted your thinking (if so, how?)
I agree timescale is a good way to think about this. My intuition is if high school math problems are 1 then IMO math problems are 100(1e2) and typical research math problems are 10,000(1e4). So exactly half way! I don’t have first hand experience with hardest research math problems, but from what I heard about timescale they seem to reach 1,000,000(1e6). I’d rate typical practical R&D problems 1e3 and transformative R&D problems 1e5.
Edit: Using this scale, I rate GPT-3 at 1 and GPT-4 at 10. This suggests GPT-5 for IMO, which feels uncomfortable to me! Thinking about this, I think while there are lots of 1-data and 10-data, there are considerably less 100-data and above that most things are not written down. But maybe that is an excuse and it doesn’t matter.