Why should we think that the relevant progress driving non-formal IMO is very important for plausibly important capabilities like agentic software engineering? [...] if the main breakthrough was in better performance on non-trivial-to-verify tasks (as various posts from OpenAI people claim), then even if this generalizes well beyond proofs this wouldn’t obviously particularly help with agentic software engineering (where the core blocker doesn’t appear to be verification difficulty).
I’m surprised by this. To me it seems hugely important how fast AIs are improving on tasks with poor feedback loops, because obviously they’re in a much better position to improve on easy-to-verify tasks, so “tasks with poor feedback loops” seem pretty likely to be the bottleneck to an intelligence explosion.
So I definitely do think that “better performance on non-trivial-to-verify tasks” are very important for some “plausibly important capabilities”. Including agentic software engineering. (Like: This also seems related to why the AIs are much better at benchmarks than at helping people out with their day-to-day work.)
Hmm, yeah I think you’re right, though I also don’t think I articulated what I was trying to say very well.
Like I think my view is:
There was some story where we would see very fast progress in relatively easy to verify (or trivial to verify) tasks and I’m talking about that. It seems like agentic software engineering could reach very high levels without necessarily needing serious improvements in harder to verify tasks.
Faster progress in non-trivial-to-verify tasks might not be the limiting factor if progress in easy to verify tasks isn’t that fast.
I still think that there won’t be a noticable jump as the IMO methods make it into production models but this is due to more general heuristics (and the methods maybe still matter, it just won’t be something to wait for I think).
I’m surprised by this. To me it seems hugely important how fast AIs are improving on tasks with poor feedback loops, because obviously they’re in a much better position to improve on easy-to-verify tasks, so “tasks with poor feedback loops” seem pretty likely to be the bottleneck to an intelligence explosion.
So I definitely do think that “better performance on non-trivial-to-verify tasks” are very important for some “plausibly important capabilities”. Including agentic software engineering. (Like: This also seems related to why the AIs are much better at benchmarks than at helping people out with their day-to-day work.)
Hmm, yeah I think you’re right, though I also don’t think I articulated what I was trying to say very well.
Like I think my view is:
There was some story where we would see very fast progress in relatively easy to verify (or trivial to verify) tasks and I’m talking about that. It seems like agentic software engineering could reach very high levels without necessarily needing serious improvements in harder to verify tasks.
Faster progress in non-trivial-to-verify tasks might not be the limiting factor if progress in easy to verify tasks isn’t that fast.
I still think that there won’t be a noticable jump as the IMO methods make it into production models but this is due to more general heuristics (and the methods maybe still matter, it just won’t be something to wait for I think).