Since this is mid-late 2025, we seem to be behind the aggressive AI 2027 schedule? The claims here are pretty weak, but if LLMs really don’t boost coding speed, this description still seems to be wrong.
[edit: okay actually it’s pretty much mid 2025 still, months don’t count from zero though probably they should because they’re mod 12]
I don’t think there’s enough evidence to draw hard conclusions about this section’s accuracy in either direction, but I would err on the side of thinking ai-2027′s description is correct.
Footnote 10, visible in your screenshot, reads:
For example, we think coding agents will move towards functioning like Devin. We forecast that mid-2025 agents will score 85% on SWEBench-Verified.
(Is it fair to allow pass@k? This Manifold Market doesn’t allow it for its own resolution, but here I think it’s okay, given that the footnote above makes claims about ‘coding agents’, which presumably allow iteration at test time.)
Also, note the following paragraph immediately after your screenshot:
The agents are impressive in theory (and in cherry-picked examples), but in practice unreliable. AI twitter is full of stories about tasks bungled in some particularly hilarious way. The better agents are also expensive; you get what you pay for, and the best performance costs hundreds of dollars a month.11 Still, many companies find ways to fit AI agents into their workflows.12
AI twitter sure is full of both impressive cherry-picked examples, but also stories about bungled tasks. I also agree that the claims about “find[ing] ways to fit AI agents into the workflows” is exceedingly weak. But it’s certainly happening. A quick Google for “AI agent integration” turns up this article from IBM, where agents are diffusing across multiple levels of the company.
If I understand correctly, Claude’s pass@X benchmarks mean multiple sampling and taking the best result. This is valid so long as compute cost isn’t exceeding equivalent cost of an engineer.
codex’s pass @ 8 score seems to be saying “the correct solution was present in 8 attempts, but the model doesn’t actually know what the correct result is”. That shouldn’t count.
Yeah, I wanted to include that paragraph but it didn’t fit in the screenshot. It does seem slightly redeeming for the description. Certainly the authors hedged pretty heavily.
Still, I think that people are not saving days by chatting with AI agents on slack. So there’s a vibe here which seems wrong. The vibe is that these agents are unreliable but are offering very significant benefits. That is called into question by the METR report showing they slowed developers down. There are problems with that report and I would love to see some follow-up work to be more certain.
I appreciate your research on the SOTA SWEBench-Verified scores! That’s a concrete prediction we can evaluate (less important than real world performance, but at least more objective). Since we’re now in mid-late 2025 (not mid 2025), it appears that models are slightly behind their projections even for pass@k, but certainly they were in the right ballpark!
Sorry, this is the most annoying kind of nitpicking on my part, but since I guess it’s probably relevant here (and for your other comment responding to Stanislav down below), the center point of the year is July 2, 2025. So we’re just two weeks past the absolute mid-point – that’s 54.4% of the way through the year.
Also, the codex-1 benchmarks released on May 16, while Claude 4′s were announced on May 22 (certainly before the midpoint).
The prediction is correct on all counts, and perhaps slightly understates progress (though it obviously makes weak/ambiguous claims across the board).
The claim that “coding and research agents are beginning to transform their professions” is straightforwardly true (e.g. 50% of Google lines of code are now generated by AI). The METR study was concentrated in March (which is early 2025).
And it is not currently “mid-late 2025”, it is 16 days after the exact midpoint of the year.
Where is that 50% number from? Perhaps you are referring to this post from google research. If so, you seem to have taken it seriously out of context. Here is the text before the chart that shows 50% completion:
With the advent of transformer architectures, we started exploring how to apply LLMs to software development. LLM-based inline code completion is the most popular application of AI applied to software development: it is a natural application of LLM technology to use the code itself as training data. The UX feels natural to developers since word-level autocomplete has been a core feature of IDEs for many years. Also, it’s possible to use a rough measure of impact, e.g., the percentage of new characters written by AI. For these reasons and more, it made sense for this application of LLMs to be the first to deploy.
Our earlier blog describes the ways in which we improve user experience with code completion and how we measure impact. Since then, we have seen continued fast growth similar to other enterprise contexts, with an acceptance rate by software engineers of 37%[1] assisting in the completion of 50% of code characters[2]. In other words, the same amount of characters in the code are now completed with AI-based assistance as are manually typed by developers. While developers still need to spend time reviewing suggestions, they have more time to focus on code design.
This is referring to inline code completion—so its more like advanced autocomplete than an AI coding agent. It’s hard to interpret this number, but it seems very unlikely this means half the coding is being done by AI and much more likely that it is often easy to predict how a line of code will end given the first half of that line of code and the previous context. Probably 15-20% of what I type into a standard linux terminal is autocompleted without AI?
Also, the right metric is how much AI assistance is speeding up coding. I know of only one study on this, from METR, which showed that it is slowing down coding.
Two days later, is this still a fail? ChatGPT agent is supposed to exactly that. There seems to be a research model within openAI that is capable of getting gold on IMO without any tools.
Maybe it does not meet the expectations yet. Maybe it will with GPT-5 release. We do not know if the new unreleased model is capable of helping with research. However, it’s worth considering the possibility that it could be on a slightly slower timeline and not a complete miss.
i wonder to what extent leadership at openai see ai 2027 as a bunch of milestones that they need to meet, to really be as powerful/scary as they’re said to be.
e.g. would investors/lenders be more hesitant if openai seems to be ‘lagging behind’ ai 2027 predictions?
But it isn’t August or September yet. Maybe someone will end up actually creating capable agents. In addition, the amount of operations used for creating Grok 4 was estimated as 4e27--6e27, which seems to align with the forecast. The research boost rate by Grok 4 or a potentially tweaked model wasn’t estimated. Maybe Grok 4 or an AI released in August will boost research speed?
It was indicated in the opening slide of Grok 4 release livestream that Grok 4 was pretrained with the same amount of compute as Grok 3, which in turn was pretrained on 100K H100s, so probably 3e26 FLOPs (40% utilization for 3 months with 1e15 FLOP/s per chip). RLVR has a 3x-4x lower compute utilization than pretraining, so if we are insisting on counting RLVR in FLOPs, then 3 months of RLVR might be 9e25 FLOPs, for the total of 4e26 FLOPs.
Stargate Abilene will be 400K chips in GB200 NVL72 racks in 2026, which is 10x more FLOP/s than 100K H100s. So it’ll be able to train 4e27-8e27 FLOPs models (pretraining and RLVR, in 3+3 months), and it might be early 2027 when they are fully trained. (Google is likely to remain inscrutable in their training compute usage, though Meta might also catch up by then.)
the amount of operations used for creating Grok 4 was estimated as 4e27--6e27
(I do realize it’s probably some sort of typo, either yours or in your unnamed source. But 10x is almost 2 years of even the current fast funding-fueled scaling, that’s not a small difference.)
We’ve been going on back and forth on this a bit—it seems like your model suggests AGI in 2027 is pretty unlikely?
That is, we see the first generation of massively scaled RLVR around 2026/2027. So it kind of has to work out of the box for AGI to arrive that quickly?
I suppose this is just speculation though. Maybe it’s useful enough that the next generation is somehow much, much faster to arrive?
That is, we see the first generation of massively scaled RLVR around 2026/2027. So it kind of has to work out of the box for AGI to arrive that quickly?
By 2027, we’ll also have 10x scaled-up pretraining compared to current models (trained on 2024 compute). And correspondingly scaled RLVR, with many diverse tool-using environments that are not just about math and coding contest style problems. If we go 10x lower than current pretraining, we get original GPT-4 from Mar 2023, which is significantly worse than the current models. So with 10x higher pretraining than current models, the models of 2027 might make significantly better use of RLVR training than the current models can.
Also, 2 years might be enough time to get some sort of test-time training capability started, either with novel or currently-secret methods, or by RLVRing models to autonomously do post-training on variants of themselves to make them better at particular sources of tasks during narrow deployment. Apparently Sutskever’s SSI is rumored to be working on the problem (at 39:25 in the podcast), and overall this seems like the most glaring currently-absent faculty. (Once it’s implemented, something else might end up a similarly obvious missing piece.)
it seems like your model suggests AGI in 2027 is pretty unlikely?
I’d give it 10% (for 2025-2027). From my impression of the current capabilities and the effect of scaling so far, the remaining 2 OOMs of compute seem like a 30% probability of getting there (by about 2030), with a third of it in the first 10x of the remaining scaling, that is 10% with 2026 compute (for 2027 models). After 2029, scaling slows down to a crawl (relatively speaking), so maybe another 50% for the 1000x of scaling in 2030-2045 when there’ll also be time for any useful schlep, with 20% remaining for 2045+ (some of it from a coordinated AI Pause, which I think is likely to last if at all credibly established). If the 5 GW AI training systems don’t get built in 2028-2029, they are still likely to get built a bit later, so this essentially doesn’t influence predictions outside the 2029-2033 window, some probability within it merely gets pushed a bit towards the future.
So this gives a median of about 2034. Once AGI is still not working in the early 2030s even with more time for schlep, probability at that level of compute starts going down, so 2030s are front-loaded in probability even though compute is not scaling faster in the early 2030s than later.
Since this is mid-late 2025, we seem to be behind the aggressive AI 2027 schedule? The claims here are pretty weak, but if LLMs really don’t boost coding speed, this description still seems to be wrong.
[edit: okay actually it’s pretty much mid 2025 still, months don’t count from zero though probably they should because they’re mod 12]
I don’t think there’s enough evidence to draw hard conclusions about this section’s accuracy in either direction, but I would err on the side of thinking ai-2027′s description is correct.
Footnote 10, visible in your screenshot, reads:
SOTA models score at:
• 83.86% (codex-1, pass@8)
• 80.2% (Sonnet 4, pass@several, unclear how many)
• 79.4% (Opus 4, pass@several)
(Is it fair to allow pass@k? This Manifold Market doesn’t allow it for its own resolution, but here I think it’s okay, given that the footnote above makes claims about ‘coding agents’, which presumably allow iteration at test time.)
Also, note the following paragraph immediately after your screenshot:
AI twitter sure is full of both impressive cherry-picked examples, but also stories about bungled tasks. I also agree that the claims about “find[ing] ways to fit AI agents into the workflows” is exceedingly weak. But it’s certainly happening. A quick Google for “AI agent integration” turns up this article from IBM, where agents are diffusing across multiple levels of the company.
If I understand correctly, Claude’s pass@X benchmarks mean multiple sampling and taking the best result. This is valid so long as compute cost isn’t exceeding equivalent cost of an engineer.
codex’s pass @ 8 score seems to be saying “the correct solution was present in 8 attempts, but the model doesn’t actually know what the correct result is”. That shouldn’t count.
Why do I see no higher than about 75% here?
https://www.swebench.com
Yeah, I wanted to include that paragraph but it didn’t fit in the screenshot. It does seem slightly redeeming for the description. Certainly the authors hedged pretty heavily.
Still, I think that people are not saving days by chatting with AI agents on slack. So there’s a vibe here which seems wrong. The vibe is that these agents are unreliable but are offering very significant benefits. That is called into question by the METR report showing they slowed developers down. There are problems with that report and I would love to see some follow-up work to be more certain.
I appreciate your research on the SOTA SWEBench-Verified scores! That’s a concrete prediction we can evaluate (less important than real world performance, but at least more objective). Since we’re now in mid-late 2025 (not mid 2025), it appears that models are slightly behind their projections even for pass@k, but certainly they were in the right ballpark!
Sorry, this is the most annoying kind of nitpicking on my part, but since I guess it’s probably relevant here (and for your other comment responding to Stanislav down below), the center point of the year is July 2, 2025. So we’re just two weeks past the absolute mid-point – that’s 54.4% of the way through the year.
Also, the codex-1 benchmarks released on May 16, while Claude 4′s were announced on May 22 (certainly before the midpoint).
The prediction is correct on all counts, and perhaps slightly understates progress (though it obviously makes weak/ambiguous claims across the board).
The claim that “coding and research agents are beginning to transform their professions” is straightforwardly true (e.g. 50% of Google lines of code are now generated by AI). The METR study was concentrated in March (which is early 2025).
And it is not currently “mid-late 2025”, it is 16 days after the exact midpoint of the year.
Where is that 50% number from? Perhaps you are referring to this post from google research. If so, you seem to have taken it seriously out of context. Here is the text before the chart that shows 50% completion:
This is referring to inline code completion—so its more like advanced autocomplete than an AI coding agent. It’s hard to interpret this number, but it seems very unlikely this means half the coding is being done by AI and much more likely that it is often easy to predict how a line of code will end given the first half of that line of code and the previous context. Probably 15-20% of what I type into a standard linux terminal is autocompleted without AI?
Also, the right metric is how much AI assistance is speeding up coding. I know of only one study on this, from METR, which showed that it is slowing down coding.
Progress wise this seems accurate but the usefulness gap is probably larger than the one this paints.
Two days later, is this still a fail? ChatGPT agent is supposed to exactly that. There seems to be a research model within openAI that is capable of getting gold on IMO without any tools.
Maybe it does not meet the expectations yet. Maybe it will with GPT-5 release. We do not know if the new unreleased model is capable of helping with research. However, it’s worth considering the possibility that it could be on a slightly slower timeline and not a complete miss.
i wonder to what extent leadership at openai see ai 2027 as a bunch of milestones that they need to meet, to really be as powerful/scary as they’re said to be.
e.g. would investors/lenders be more hesitant if openai seems to be ‘lagging behind’ ai 2027 predictions?
Yeah, I wouldn’t be surprised if these timelines are at least somewhat hyperstitious
Yeah, well, let’s wait and see what GPT-5 looks like.
But it isn’t August or September yet. Maybe someone will end up actually creating capable agents. In addition, the amount of operations used for creating Grok 4 was estimated as 4e27--6e27, which seems to align with the forecast. The research boost rate by Grok 4 or a potentially tweaked model wasn’t estimated. Maybe Grok 4 or an AI released in August will boost research speed?
It was indicated in the opening slide of Grok 4 release livestream that Grok 4 was pretrained with the same amount of compute as Grok 3, which in turn was pretrained on 100K H100s, so probably 3e26 FLOPs (40% utilization for 3 months with 1e15 FLOP/s per chip). RLVR has a 3x-4x lower compute utilization than pretraining, so if we are insisting on counting RLVR in FLOPs, then 3 months of RLVR might be 9e25 FLOPs, for the total of 4e26 FLOPs.
Stargate Abilene will be 400K chips in GB200 NVL72 racks in 2026, which is 10x more FLOP/s than 100K H100s. So it’ll be able to train 4e27-8e27 FLOPs models (pretraining and RLVR, in 3+3 months), and it might be early 2027 when they are fully trained. (Google is likely to remain inscrutable in their training compute usage, though Meta might also catch up by then.)
(I do realize it’s probably some sort of typo, either yours or in your unnamed source. But 10x is almost 2 years of even the current fast funding-fueled scaling, that’s not a small difference.)
We’ve been going on back and forth on this a bit—it seems like your model suggests AGI in 2027 is pretty unlikely?
That is, we see the first generation of massively scaled RLVR around 2026/2027. So it kind of has to work out of the box for AGI to arrive that quickly?
I suppose this is just speculation though. Maybe it’s useful enough that the next generation is somehow much, much faster to arrive?
By 2027, we’ll also have 10x scaled-up pretraining compared to current models (trained on 2024 compute). And correspondingly scaled RLVR, with many diverse tool-using environments that are not just about math and coding contest style problems. If we go 10x lower than current pretraining, we get original GPT-4 from Mar 2023, which is significantly worse than the current models. So with 10x higher pretraining than current models, the models of 2027 might make significantly better use of RLVR training than the current models can.
Also, 2 years might be enough time to get some sort of test-time training capability started, either with novel or currently-secret methods, or by RLVRing models to autonomously do post-training on variants of themselves to make them better at particular sources of tasks during narrow deployment. Apparently Sutskever’s SSI is rumored to be working on the problem (at 39:25 in the podcast), and overall this seems like the most glaring currently-absent faculty. (Once it’s implemented, something else might end up a similarly obvious missing piece.)
I’d give it 10% (for 2025-2027). From my impression of the current capabilities and the effect of scaling so far, the remaining 2 OOMs of compute seem like a 30% probability of getting there (by about 2030), with a third of it in the first 10x of the remaining scaling, that is 10% with 2026 compute (for 2027 models). After 2029, scaling slows down to a crawl (relatively speaking), so maybe another 50% for the 1000x of scaling in 2030-2045 when there’ll also be time for any useful schlep, with 20% remaining for 2045+ (some of it from a coordinated AI Pause, which I think is likely to last if at all credibly established). If the 5 GW AI training systems don’t get built in 2028-2029, they are still likely to get built a bit later, so this essentially doesn’t influence predictions outside the 2029-2033 window, some probability within it merely gets pushed a bit towards the future.
So this gives a median of about 2034. Once AGI is still not working in the early 2030s even with more time for schlep, probability at that level of compute starts going down, so 2030s are front-loaded in probability even though compute is not scaling faster in the early 2030s than later.
The section I shared is about mid 2025. I think August-September is late 2025.
Early: January, February, March, April
Mid: May, June, July, August
Late: September, October, November, December
Okay yes but this thread of discussion has gone long enough now I think—we basically agree up to a month.