I think if you look at “horizon length”—at what task duration (in terms of human completion time) do the AIs get the task right 50% of the time—the trends will indicate doubling times of maybe 4 months (though 6 months is plausible). Let’s say 6 months more conservatively. I think AIs are at like 30 minutes on math? And 1 hour on software engineering. It’s a bit unclear, but let’s go with that. Then, to get to 64 hours on math, we’d need 7 doublings = 3.5 years. So, I think the naive trend extrapolation is much faster than you think? (And this estimate strikes me as conservative at least for math IMO.)
FWIW, this seems like an overestimate to me. Maybe o3 is better than other things, but I definitely can’t get equivalents of 1-hour chunks out of language models, unless it happens to be an extremely boilerplate-heavy step. My guess is more like 15-minutes, and for debugging (which in my experience is close to most software-engineering time), more like 5-10 minutes.
The question of context might be important, see here. I wouldn’t find 15 minutes that surprising for ~50% success rate, but I’ve seen numbers more like 1.5 hours. I thought this was likely to be an overestimate so I went down to 1 hour, but more like 15-30 minutes is also plausible.
Keep in mind that I’m talking about agent scaffolds here.
Keep in mind that I’m talking about agent scaffolds here.
Yeah, I have failed to get any value out of agent scaffolds, and I don’t think I know anyone else who has so far. If anyone has gotten more value out of them than just the Cursor chat, I would love to see how they do it!
All things like Cursor composer and codebuff and other scaffolds have been worse than useless for me (though I haven’t tried it again after o3-mini, which maybe made a difference, it’s been on my to-do list to give it another try).
FYI I do find that aider using a mixed routing between r1 and o3-mini-high as the architect model with sonnet as the editor model to be slightly better than cursor/windsurf etc.
Or for minimal setup, this is what is ranking the highest on aider-polyglot test: aider --architect --model openrouter/deepseek/deepseek-r1 --editor-model sonnet
(I don’t expect o3-mini is a much better agent than 3.5 sonnet new out of the box, but probably a hybrid scaffold with o3 + 3.5 sonnet will be substantially better than 3.5 sonnet. Just o3 might also be very good. Putting aside cost, I think o1 is usually better than o3-mini on open ended programing agency tasks I think.)
I don’t think a doubling every 4 or 6 months is plausible. I don’t think a doubling on any fixed time is plausible because I don’t think overall progress will be exponential. I think you could have exponential progress on thought generation, but this won’t yield exponential progress on performance. That’s what I was trying to get at with this paragraph:
My hot take is that the graphics I opened the post with were basically correct in modeling thought generation. Perhaps you could argue that progress wasn’t quite as fast as the most extreme versions predicted, but LLMs did go from subhuman to superhuman thought generation in a few years, so that’s pretty fast. But intelligence isn’t a singular capability; it’s two capabilities a phenomenon better modeled as two capabilities, and increasing just one of them happens to have sub-linear returns on overall performance.
So far (as measured by the 7card puzzle, which It think is a fair data point) I think we went from ‘no sequential reasoning whatsoever’ to ‘attempted sequential reasoning but basically failed’ (Jun13 update) to now being able to do genuine sequential reasoning for the first time. And if you look at how DeepSeek does it, to me this looks like the kind of thing where I expect difficulty to grow exponentially with argument length. (Based on stuff like it constantly having to go back and double checking even when it got something right.)
What I’d expect from this is not a doubling every N months, but perhaps an ability to reliably do one more step every N months. I think this translates into more above-constant returns on the “horizon length” scale—because I think humans need more than 2x time for 2x steps—but not exponential returns.
I expect difficulty to grow exponentially with argument length. (Based on stuff like it constantly having to go back and double checking even when it got something right.)
Training of DeepSeek-R1 doesn’t seem to do anything at all to incentivize shorter reasoning traces, so it’s just rechecking again and again because why not. Like if you are taking an important 3 hour written test, and you are done in 1 hour, it’s prudent to spend the remaining 2 hours obsessively verifying everything.
I think if you look at “horizon length”—at what task duration (in terms of human completion time) do the AIs get the task right 50% of the time—the trends will indicate doubling times of maybe 4 months (though 6 months is plausible). Let’s say 6 months more conservatively. I think AIs are at like 30 minutes on math? And 1 hour on software engineering. It’s a bit unclear, but let’s go with that. Then, to get to 64 hours on math, we’d need 7 doublings = 3.5 years. So, I think the naive trend extrapolation is much faster than you think? (And this estimate strikes me as conservative at least for math IMO.)
FWIW, this seems like an overestimate to me. Maybe o3 is better than other things, but I definitely can’t get equivalents of 1-hour chunks out of language models, unless it happens to be an extremely boilerplate-heavy step. My guess is more like 15-minutes, and for debugging (which in my experience is close to most software-engineering time), more like 5-10 minutes.
The question of context might be important, see here. I wouldn’t find 15 minutes that surprising for ~50% success rate, but I’ve seen numbers more like 1.5 hours. I thought this was likely to be an overestimate so I went down to 1 hour, but more like 15-30 minutes is also plausible.
Keep in mind that I’m talking about agent scaffolds here.
Yeah, I have failed to get any value out of agent scaffolds, and I don’t think I know anyone else who has so far. If anyone has gotten more value out of them than just the Cursor chat, I would love to see how they do it!
All things like Cursor composer and codebuff and other scaffolds have been worse than useless for me (though I haven’t tried it again after o3-mini, which maybe made a difference, it’s been on my to-do list to give it another try).
FYI I do find that aider using a mixed routing between r1 and o3-mini-high as the architect model with sonnet as the editor model to be slightly better than cursor/windsurf etc.
Or for minimal setup, this is what is ranking the highest on aider-polyglot test:
aider --architect --model openrouter/deepseek/deepseek-r1 --editor-model sonnet
(I don’t expect o3-mini is a much better agent than 3.5 sonnet new out of the box, but probably a hybrid scaffold with o3 + 3.5 sonnet will be substantially better than 3.5 sonnet. Just o3 might also be very good. Putting aside cost, I think o1 is usually better than o3-mini on open ended programing agency tasks I think.)
I don’t think a doubling every 4 or 6 months is plausible. I don’t think a doubling on any fixed time is plausible because I don’t think overall progress will be exponential. I think you could have exponential progress on thought generation, but this won’t yield exponential progress on performance. That’s what I was trying to get at with this paragraph:
So far (as measured by the 7card puzzle, which It think is a fair data point) I think we went from ‘no sequential reasoning whatsoever’ to ‘attempted sequential reasoning but basically failed’ (Jun13 update) to now being able to do genuine sequential reasoning for the first time. And if you look at how DeepSeek does it, to me this looks like the kind of thing where I expect difficulty to grow exponentially with argument length. (Based on stuff like it constantly having to go back and double checking even when it got something right.)
What I’d expect from this is not a doubling every N months, but perhaps an ability to reliably do one more step every N months. I think this translates into more above-constant returns on the “horizon length” scale—because I think humans need more than 2x time for 2x steps—but not exponential returns.
Training of DeepSeek-R1 doesn’t seem to do anything at all to incentivize shorter reasoning traces, so it’s just rechecking again and again because why not. Like if you are taking an important 3 hour written test, and you are done in 1 hour, it’s prudent to spend the remaining 2 hours obsessively verifying everything.
Are you aware of the recent metr paper which measured AI Ability to Complete Long Tasks and found out it doubles every 7 months?
Yeah.