I think IMO results were driven by general purpose advances, but I agree I can’t conclusively prove it because we don’t know details. Hopefully we will learn more as time goes by.
An informal argument: I think currently agentic software engineering is blocked on context rot, among other things. I expect IMO systems to have improved on this, since IMO time control is 1.5 hours per problem.
(I’m skeptical that much of the IMO improvement was due to improving how well AIs can use their context in general. This isn’t a crux for my view, but it also seems pretty likely that the AIs didn’t do more than ~100k serial tokens of reasoning for the IMO while still aggregating over many such reasoning traces.)
I think IMO results were driven by general purpose advances, but I agree I can’t conclusively prove it because we don’t know details. Hopefully we will learn more as time goes by.
An informal argument: I think currently agentic software engineering is blocked on context rot, among other things. I expect IMO systems to have improved on this, since IMO time control is 1.5 hours per problem.
(I’m skeptical that much of the IMO improvement was due to improving how well AIs can use their context in general. This isn’t a crux for my view, but it also seems pretty likely that the AIs didn’t do more than ~100k serial tokens of reasoning for the IMO while still aggregating over many such reasoning traces.)