Worth noting this year’s p3 was really easy, Gemini 2.5 pro even got it some of the time, and Grok 4 Heavy and Gemini Deep Think got problems rated as harder. Still an achievement, though.
This is important context not only for evaluating Greg Burnham’s accuracy but also for the Gold Medal headline. If this difficulty chart is accurate (still no idea on the maths), getting 5⁄6 is not much of a surprise. Even question 2 and 5 seem abnormally easy relative to previous years.
To put into perspective, there was only an 8% chance P3 would be this easy, putting substantial weight on the “unexpected” part being the problem being so easy. It’s also the first time in 20 years (5% chance) that 5 problems were of difficulty ⇐ 25.
Indeed, knowing that Gemini 2.5 Deep Think could solve an N25 (IMO result from Gemini 2.5 pro) and an A30 (known from Gemini 2.5 Deep think post), I’m somewhat less impressed. Only barriers were a medium-ish geometry problem (P2), which of course alpha geometry could solve and an easy combinatorics (P1).
* OpenAI’s LLM was able to solve a medium level geometry problem. (guessing Deepmind just used alpha geometry again) - Furman thought this would be hard for informal methods. * OpenAI’s LLM is strong enough to get the easy combinatorics problem (Furman noted informal methods would likely outperform formal ones on this one—just a matter if the LLM were smart enough)
I have no idea of the maths, but reading through the epoch article it seems to me that this result is entirely unexpected.
“but this year I’d only give a 5% chance to either a qualitatively creative solution or a solution to P3 or P6 from an LLM.”
Sure it’s unreleased LLM but it still seems to be an LLM.
Worth noting this year’s p3 was really easy, Gemini 2.5 pro even got it some of the time, and Grok 4 Heavy and Gemini Deep Think got problems rated as harder. Still an achievement, though.
From the author of the epoch article:
https://x.com/GregHBurnham/status/1946655635400950211
https://x.com/GregHBurnham/status/1946725960557949227
https://x.com/GregHBurnham/status/1946567312850530522
This is important context not only for evaluating Greg Burnham’s accuracy but also for the Gold Medal headline. If this difficulty chart is accurate (still no idea on the maths), getting 5⁄6 is not much of a surprise. Even question 2 and 5 seem abnormally easy relative to previous years.
To put into perspective, there was only an 8% chance P3 would be this easy, putting substantial weight on the “unexpected” part being the problem being so easy. It’s also the first time in 20 years (5% chance) that 5 problems were of difficulty ⇐ 25.
Indeed, knowing that Gemini 2.5 Deep Think could solve an N25 (IMO result from Gemini 2.5 pro) and an A30 (known from Gemini 2.5 Deep think post), I’m somewhat less impressed. Only barriers were a medium-ish geometry problem (P2), which of course alpha geometry could solve and an easy combinatorics (P1).
The two most impressive things are, factoring this write up by Ralph Furman:
* OpenAI’s LLM was able to solve a medium level geometry problem. (guessing Deepmind just used alpha geometry again) - Furman thought this would be hard for informal methods.
* OpenAI’s LLM is strong enough to get the easy combinatorics problem (Furman noted informal methods would likely outperform formal ones on this one—just a matter if the LLM were smart enough)