Just to quibble a little: I don’t think it’s wise to estimate scaffolding improvements for general capabilities as near zero. By scaffolding, I mean prompt engineering and structuring systems of prompts to create systems in which calls to LLMs are component operations, roughly in line with the original definition. I’ve been surprised that scaffolding didn’t make more of a difference faster and would agree that it’s been near zero for general capabilities. I think this changed when Perplexity partly replicated OAI’s Deep Research accomplishments in both subjective usefulness and performance on Humanity’s Last Exam (~21% vs ~27% ), and they did it in around two weeks, indicating little model training. These capabilities are fairly general, and this seems like pretty clearly a success primarily of scaffolding, showing that Deep Research was partly based on o3′s smarts and but also heavily helped by the scaffold for iterative search, consolidation, and refinement.
This technically falls into the algorithmic improvement category, but it’s a reason to think that pace increases or even accelerates by having a largely-separate route to improvement. Or maybe this is more like a one-time boost—but it could be a large one and in particularly important directions, like solving entirely novel problems.
DeepMind’s co-scientist also seems like a genuine accomplishment in scaffoldng. It used Gemini Pro 2.0 which just wasn’t that good, and seemed to do something extremely impressive—predict cutting-edge research hypotheses in one shot by reviewing the relevant literature and “thinking” about it deeply. I haven’t looked deeply enough to know if that story is exaggerated or not. If not, similar scaffolds running 2.5 Pro or o3+ might be ready to genuinely accelerate research in ML and elsewhere.
Ege Erdil’s argument for long timelines to Dwarkesh was that algorithmic progress is driven by compute progress, so we should expect it to slow once we run out of compute to just buy. I think this is probably partly true, but the opposite trend, that more effort will go into algorithmes if/when compute is a less promising route, also probably holds.
So compute expansion will probably slow down, and it may well produce less gains going forward. But there is little reason to expect algorithmic progress to slow, and it may even speed up somewhat as more effort focuses on it, we work out how scaffolding can be useful, and more effort is devoted to it (some money that can’t buy compute may go toward working on algorithms of various types).
I’m not sure I particularly disagree. My exact claim is just:
my sense is that scaffolding has yielded extremely minimal gains on general purpose autonomous software engineering to date
So, I just think no one has really outperformed baselines for very general domains. In more specific domains people have outperformed, e.g. for writing kernels.
I also think it’s probably possible to do a bunch of scaffolding improvements on general purpose autonomous software engineering (potentially scaffolds that use a bunch more runtime compute), if by no other mechanism than by writing a bunch of more specialized scaffolds that the model sometimes chooses to apply.
That said my guess is that it’s pretty likely that scaffolding doesn’t matter that much in practice in the future, at least until AIs are writing a ton of scaffolds autonomously. This is despite it being possible for scaffolding to improve performance: you can get much/all of the benefits of extensive scaffolding via giving AIs a small number of general purpose tools and doing a bunch of RL, and it looks like this is the direction people are going in for better or worse.
Ah yes. I actually missed that you’d scoped that statement to general purpose software engineering. That is indeed one of the most relevant capabilities. I was thinking of general purpose problem-solving, another of the most critical capabilities for AI to become really dangerous.
I agree that even if scaffolding could work, RL on long CoT does something similar, and that’s where the effort and momentum is going.
AIs writing and testing scaffolds semi-autonomously is something I hadn’t considered. There might be a pretty tight loop that could make that effective.
This succinct summary is highly useful, thanks!
Just to quibble a little: I don’t think it’s wise to estimate scaffolding improvements for general capabilities as near zero. By scaffolding, I mean prompt engineering and structuring systems of prompts to create systems in which calls to LLMs are component operations, roughly in line with the original definition. I’ve been surprised that scaffolding didn’t make more of a difference faster and would agree that it’s been near zero for general capabilities. I think this changed when Perplexity partly replicated OAI’s Deep Research accomplishments in both subjective usefulness and performance on Humanity’s Last Exam (~21% vs ~27% ), and they did it in around two weeks, indicating little model training. These capabilities are fairly general, and this seems like pretty clearly a success primarily of scaffolding, showing that Deep Research was partly based on o3′s smarts and but also heavily helped by the scaffold for iterative search, consolidation, and refinement.
This technically falls into the algorithmic improvement category, but it’s a reason to think that pace increases or even accelerates by having a largely-separate route to improvement. Or maybe this is more like a one-time boost—but it could be a large one and in particularly important directions, like solving entirely novel problems.
DeepMind’s co-scientist also seems like a genuine accomplishment in scaffoldng. It used Gemini Pro 2.0 which just wasn’t that good, and seemed to do something extremely impressive—predict cutting-edge research hypotheses in one shot by reviewing the relevant literature and “thinking” about it deeply. I haven’t looked deeply enough to know if that story is exaggerated or not. If not, similar scaffolds running 2.5 Pro or o3+ might be ready to genuinely accelerate research in ML and elsewhere.
Ege Erdil’s argument for long timelines to Dwarkesh was that algorithmic progress is driven by compute progress, so we should expect it to slow once we run out of compute to just buy. I think this is probably partly true, but the opposite trend, that more effort will go into algorithmes if/when compute is a less promising route, also probably holds.
So compute expansion will probably slow down, and it may well produce less gains going forward. But there is little reason to expect algorithmic progress to slow, and it may even speed up somewhat as more effort focuses on it, we work out how scaffolding can be useful, and more effort is devoted to it (some money that can’t buy compute may go toward working on algorithms of various types).
I’m not sure I particularly disagree. My exact claim is just:
So, I just think no one has really outperformed baselines for very general domains. In more specific domains people have outperformed, e.g. for writing kernels.
I also think it’s probably possible to do a bunch of scaffolding improvements on general purpose autonomous software engineering (potentially scaffolds that use a bunch more runtime compute), if by no other mechanism than by writing a bunch of more specialized scaffolds that the model sometimes chooses to apply.
That said my guess is that it’s pretty likely that scaffolding doesn’t matter that much in practice in the future, at least until AIs are writing a ton of scaffolds autonomously. This is despite it being possible for scaffolding to improve performance: you can get much/all of the benefits of extensive scaffolding via giving AIs a small number of general purpose tools and doing a bunch of RL, and it looks like this is the direction people are going in for better or worse.
Ah yes. I actually missed that you’d scoped that statement to general purpose software engineering. That is indeed one of the most relevant capabilities. I was thinking of general purpose problem-solving, another of the most critical capabilities for AI to become really dangerous.
I agree that even if scaffolding could work, RL on long CoT does something similar, and that’s where the effort and momentum is going.
AIs writing and testing scaffolds semi-autonomously is something I hadn’t considered. There might be a pretty tight loop that could make that effective.