I think non-formal IMO gold was unexpected and we heard explicitly that it won’t be in GPT-5. So I would wait to see how it would pan out. It may not matter in 2025 but I think it can in 2026.
Why should we think that the relevant progress driving non-formal IMO is very important for plausibly important capabilities like agentic software engineering? I’d guess the transfer is relatively weak unless the IMO results were driven by general purpose advances. This seems somewhat unlikely: if the main breakthrough was in better performance on non-trivial-to-verify tasks (as various posts from OpenAI people claim), then even if this generalizes well beyond proofs this wouldn’t obviously particularly help with agentic software engineering (where the core blocker doesn’t appear to be verification difficulty).
Edit: I think I mostly retract this comment, see below.
Why should we think that the relevant progress driving non-formal IMO is very important for plausibly important capabilities like agentic software engineering? [...] if the main breakthrough was in better performance on non-trivial-to-verify tasks (as various posts from OpenAI people claim), then even if this generalizes well beyond proofs this wouldn’t obviously particularly help with agentic software engineering (where the core blocker doesn’t appear to be verification difficulty).
I’m surprised by this. To me it seems hugely important how fast AIs are improving on tasks with poor feedback loops, because obviously they’re in a much better position to improve on easy-to-verify tasks, so “tasks with poor feedback loops” seem pretty likely to be the bottleneck to an intelligence explosion.
So I definitely do think that “better performance on non-trivial-to-verify tasks” are very important for some “plausibly important capabilities”. Including agentic software engineering. (Like: This also seems related to why the AIs are much better at benchmarks than at helping people out with their day-to-day work.)
Hmm, yeah I think you’re right, though I also don’t think I articulated what I was trying to say very well.
Like I think my view is:
There was some story where we would see very fast progress in relatively easy to verify (or trivial to verify) tasks and I’m talking about that. It seems like agentic software engineering could reach very high levels without necessarily needing serious improvements in harder to verify tasks.
Faster progress in non-trivial-to-verify tasks might not be the limiting factor if progress in easy to verify tasks isn’t that fast.
I still think that there won’t be a noticable jump as the IMO methods make it into production models but this is due to more general heuristics (and the methods maybe still matter, it just won’t be something to wait for I think).
I think IMO results were driven by general purpose advances, but I agree I can’t conclusively prove it because we don’t know details. Hopefully we will learn more as time goes by.
An informal argument: I think currently agentic software engineering is blocked on context rot, among other things. I expect IMO systems to have improved on this, since IMO time control is 1.5 hours per problem.
(I’m skeptical that much of the IMO improvement was due to improving how well AIs can use their context in general. This isn’t a crux for my view, but it also seems pretty likely that the AIs didn’t do more than ~100k serial tokens of reasoning for the IMO while still aggregating over many such reasoning traces.)
I think non-formal IMO gold was unexpected and we heard explicitly that it won’t be in GPT-5. So I would wait to see how it would pan out. It may not matter in 2025 but I think it can in 2026.
Why should we think that the relevant progress driving non-formal IMO is very important for plausibly important capabilities like agentic software engineering? I’d guess the transfer is relatively weak unless the IMO results were driven by general purpose advances. This seems somewhat unlikely: if the main breakthrough was in better performance on non-trivial-to-verify tasks (as various posts from OpenAI people claim), then even if this generalizes well beyond proofs this wouldn’t obviously particularly help with agentic software engineering (where the core blocker doesn’t appear to be verification difficulty).
Edit: I think I mostly retract this comment, see below.
I’m surprised by this. To me it seems hugely important how fast AIs are improving on tasks with poor feedback loops, because obviously they’re in a much better position to improve on easy-to-verify tasks, so “tasks with poor feedback loops” seem pretty likely to be the bottleneck to an intelligence explosion.
So I definitely do think that “better performance on non-trivial-to-verify tasks” are very important for some “plausibly important capabilities”. Including agentic software engineering. (Like: This also seems related to why the AIs are much better at benchmarks than at helping people out with their day-to-day work.)
Hmm, yeah I think you’re right, though I also don’t think I articulated what I was trying to say very well.
Like I think my view is:
There was some story where we would see very fast progress in relatively easy to verify (or trivial to verify) tasks and I’m talking about that. It seems like agentic software engineering could reach very high levels without necessarily needing serious improvements in harder to verify tasks.
Faster progress in non-trivial-to-verify tasks might not be the limiting factor if progress in easy to verify tasks isn’t that fast.
I still think that there won’t be a noticable jump as the IMO methods make it into production models but this is due to more general heuristics (and the methods maybe still matter, it just won’t be something to wait for I think).
I think IMO results were driven by general purpose advances, but I agree I can’t conclusively prove it because we don’t know details. Hopefully we will learn more as time goes by.
An informal argument: I think currently agentic software engineering is blocked on context rot, among other things. I expect IMO systems to have improved on this, since IMO time control is 1.5 hours per problem.
(I’m skeptical that much of the IMO improvement was due to improving how well AIs can use their context in general. This isn’t a crux for my view, but it also seems pretty likely that the AIs didn’t do more than ~100k serial tokens of reasoning for the IMO while still aggregating over many such reasoning traces.)