Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
I’ve recently written about how I’ve updated against seeing substantially faster than trend AI progress due to quickly massively scaling up RL on agentic software engineering. One response I’ve heard is something like:
RL scale-ups so far have used very crappy environments due to difficulty quickly sourcing enough decent (or even high quality) environments. Thus, once AI companies manage to get their hands on actually good RL environments (which could happen pretty quickly), performance will increase a bunch.
Another way to put this response is that AI companies haven’t actually done a good job scaling up RL—they’ve scaled up the compute, but with low quality data—and once they actually do the RL scale up for real this time, there will be a big jump in AI capabilities (which yields substantially above trend progress). I’m skeptical of this argument because I think that ongoing improvements to RL environments are already priced into the existing trend: I expect that a substantial part of the progress we saw in late 2024 and so far in 2025 was driven by AI companies acquiring better RL environments. To be clear, I do think that higher quality RL environments will be a substantial part of what drives progress over the next year or two (and perhaps beyond that), I just don’t think this will be such a large improvement as to cause above trend progress.
More generally, we should expect reasonably predictable long running trends as the result of a number of advances, including some relatively larger advances. The straight lines of advancement in AI are driven by numerous hard won advances, many of which each seem like a huge deal (that would cause above trend progress) from the inside. But, when added all together these advances just result in the overall trend: it’s just that the underlying generator of these advances was priced in; without these advances progress would have slowed. (Correspondingly, if there weren’t ideas for how to make AIs substantially more capable on the horizon, this would be a bear signal: we should expect that the pace of AI progress we’ve seen requires that there are always further promising capabilities hopes to pursue. This is possible because of effort scaling up exponentially over time.)
This isn’t to say that we should be confident that AI progress trends will continue smoothly (even if inputs continue to scale at the same rate): advances could just randomly peter out and there might be sufficiently massive breakthroughs that break the trend.
(In AI, I think we’ve seen perhaps 2 massively trend-breaking breakthroughs in the last 20 or so years: deep learning at substantial scale (starting with AlexNet) and (maybe?) scaling up generative pretraining (starting with GPT-1).[1] Scaling up RL and reasoning models probably caused somewhat above trend progress (in 2025), but I don’t think this constitutes a massive trend break.)
Additionally, it could be that the seemingly exponential trend we’re looking at is inherently sigmoidal (capping out somewhat soon) or is substantially superexponential (with this superexponentiality making a big difference somewhat soon).
I should also note that while on-trend AI progress doesn’t suggest very short AGI timelines[2], on-trend progress is fast! The naive extrapolation implies that we’ll see AIs which can complete 50% of (easily verified and self-contained) tasks that take human professionals a month in 3 years! (I’d guess this suffices for large productivity increases for many types of software engineering and substantial productivity increases in AI R&D.[3])
Overall, I mostly dismiss the specific argument that we’ll see above trend progress due to improved RL environments and more broadly my default expectation is that progress will continue at a reasonably steady rate despite being driven by new advances that feel salient to AI company insiders.[4] We could see a breakthrough in making higher quality RL environments that causes rapid above trend progress, but we probably won’t.
I’ll now address some possible counterarguments.
Counterargument: Actually, companies haven’t gotten around to improving RL environment quality until recently (or there is substantial lead time on scaling up RL environments etc.) so better RL environments didn’t drive much of late 2024 and 2025 progress
Even if the fraction of effort going into RL environments has increased greatly (e.g. because AI companies have updated to think this is more important), I expect that this effort will come from some other endeavor which wasn’t that much worse than scaling up RL environments, so the trend will probably persist. The trend is already driven by companies updating their priorities and rebalancing between different areas, so this is already priced in.
It could turn out to be the case that companies haven’t really gotten much of the value from acquiring non-crappy RL environments yet and this happens to be a much better use of resources than other endeavors, in which case we’d see above trend progress. But, this is a pretty specific hypothesis that doesn’t seem well supported by publicly available evidence, so by default, I’m skeptical.
Counterargument: AIs will soon reach a critical capability threshold where AIs themselves can build high quality RL environments
Before AIs can (mostly) autonomously build high quality RL environments for themselves, they’ll be able to substantially assist humans in building mediocre RL environments for themselves. I already expect some (probably substantial) effect from AIs helping to build RL environments. I don’t see a particular reason to expect there to be some particular critical threshold and I especially don’t see a reason to expect this critical threshold to be in the next year. It currently seems like AIs would struggle to mostly autonomously build high quality RL environments for themselves, and I don’t think the trends strongly suggest this will be different very soon.[5]
I do think that this sort of “AIs generate their own training environments” flywheel could cause superexponential progress via the same sort of mechanism as AIs automating AI R&D, though I don’t expect to see this data generation effect show up much in overall AI progress. And, even if it did show up, I’d currently expect this to ramp up more gradually (we haven’t seen much evidence of this causing faster progress in the last year, but it’s hard to tell).
This isn’t to say that critical thresholds aren’t possible or that there couldn’t be sudden large breakthroughs in having AIs generate training data for themselves, I just don’t see a particular reason to think this is likely in the short term.
Counterargument: AI companies are massively fucking up their training runs (either pretraining or RL) and once they get their shit together more, we’ll see fast progress
AI companies are constantly massively scaling up compute and switching to new clusters. And they are sometimes scaling up new training strategies (like RL). Companies only get one or a few attempts at maximum scale training runs (often at a new larger scale or on a new cluster), so it’s not that surprising that they often mess up these runs or face difficulties getting the new cluster to work. And, by the time companies figure out how to get training at a given scale to work well, they are already trying to scale up further. Companies are also generally highly chaotic with a ton of things going on, rapid growth in the number of employees, and rushed model release cycles which all cause more errors.
Thus, my view is that AI companies will probably always be messing up their maximum training runs to some extent and losing some efficiency due to this. It could be that AI companies have particularly been messing up their training runs over the last year or so and this is somewhat transient (yielding some boost as this gets resolved). But, it’s hard for me to imagine this being a huge effect because even a 4x effective compute multiplier is only around half of a year of progress and it’s hard to imagine getting more than a 4x multiplier from not messing up bigger pretraining or RL runs (because you could e.g. just operate at 4x smaller scale until you can work out problems).
The fact that AI companies are often messing up their training runs does mean that even after compute scaling stops there will still be some period of low-hanging fruit from not messing up training runs (and running longer training runs).
Counterargument: This isn’t that related to RL scale up, but OpenAI has some massive internal advance in verification which they demonstrated via getting IMO gold and this will cause (much) faster progress late this year or early next year
Naively, it’s not obvious to me that an advance in verification would make a big difference in performance on easily verified agentic software engineering tasks, but I do see how such an advance could help close the gap between easily verified software engineering and less easily verified tasks (or parts of tasks that are less easily verified, e.g. writing clean and easy to understand code). Beyond this, I think many advances that allow for better verification of natural language proofs wouldn’t really transfer to better verification in the context of agentic software engineering (which is the main capability we should be tracking according to my views).
Beyond this, it’s hard to put much confidence in rumors from OpenAI because OpenAI would have strong incentives to hype this result internally. Overall, I’d guess this advance is real, but probably isn’t that big of a deal outside of math (e.g., it drives less than 1⁄4 of a year’s worth of AI progress in agentic software engineering; though note that 1⁄4 of a year’s worth of progress feels huge from the inside; AI progress is fast). It’s also worth noting that GDM was able to do similarly well at the IMO despite likely not having this exact advance, so probably the field was generally close to IMO gold (on this IMO) without needing massive advances. We’ll have to wait and see whether this is actually a big deal.
Thoughts and speculation on scaling up the quality of RL environments
While I’ve argued against various counterarguments, it’s worth emphasizing that I do overall think that RL environment quality will be a big deal (though I’d guess this probably won’t be the most important factor driving AI progress). And I think it will be possible for AI companies to get much better RL environments given some time. One way to think about this is to analyze how much AI companies might be willing to spend on RL environments.
In 2025, relevant AI companies are spending up to tens of billions of dollars (OpenAI will maybe spend around $30 billion while Anthropic might be at more like $6 billion), so if they were willing to spend a twentieth of this money to acquire RL environments (which seems non-crazy to me), then this could easily be a billion dollars. $1 billion is actually a lot of money to spend on RL environments: you could buy 10 million environments for $100 each (e.g., 1 hour of a decent software engineer per environment), 1 million environments for $1,000 each, or 100,000 environments for $10,000 each. Environments would probably often be made in a batched or parametric way where some labor results in making many somewhat similar environments all at once, so this level of spending could justify a lot of labor on each batch of such environments. It’s unclear what the exact returns to quality are and how many RL environments companies want, but regardless companies could afford to spend a lot on each environment.
Another way to think about this is that it could be reasonable to spend within the same order of magnitude on each RL environment as you spend in compute cost to train on that environment. I think the compute cost for doing RL on a hard agentic software engineering task might be around $10 to $1000 ($0.1 to $1 for each long rollout and you might do 100 to 1k rollouts?), so this justifies a lot of spending per environment. And, environments can be reused across multiple training runs (though they could eventually grow obsolete).
I’m not that sure about the exact numbers here, but regardless a lot of spending on RL environments is likely justified. This will probably take some time to scale up if the main strategy involves using tons of human labor (rather than e.g. mostly generating these environments using AIs and purchased data, though this also could take a while to figure out). Thus, we should expect to see some scale up of RL environment generation that drives (some fraction of) AI progress for a while, though I don’t see a reason for this to kick in super quickly as I argued above.
Over time, there will be algorithmic progress on making better RL environments (for some level of spend) as processes and understanding of what makes an RL environment good improves. This effect might drive most of the improvements in RL environment quality rather than this being driven by just scaled up spending (though scaled up spending will also assist with research into how to make better RL environments in general). This type of algorithmic progress (on data creation) presumably works somewhat similarly to other types of algorithmic progress in that it’s a multiplier on resources (in this case, both a multiplier on spending on generating RL environments and a multiplier on training compute). However, I expect that algorithmic progress on data generation is less likely to transfer well to future models multiple years from now than other types of algorithmic progress.
- ↩︎
This naively suggests a rate of massively trend breaking breakthroughs of around 1 every 10 years. This is part of why I don’t put that low of a probability on full AI R&D automation in just a few years: there is a reasonable chance of a massive breakthrough which substantially boosts the probability. Note that a massive breakthrough wouldn’t necessarily suffice, but this does drive up the probability a bunch. Also, I don’t feel that confident about the rate, and I’d be sympathetic to interpreting the evidence as suggesting a rate of more like 1⁄20 years or 1⁄7 years.
- ↩︎
Given basically on-trend AI progress, less than 3 years until full automation of AI R&D seems pretty unlikely.
- ↩︎
I expect easily verified/benchmarked capabilities will surpass real world usefulness, but I expect real world usefulness will still be pretty high at the point when AIs are doing this well on easily verified tasks.
- ↩︎
That said, each time there is a salient advance which seems like it could be a very big deal (once scaled up / figured out / etc.), we should put some probability on this driving above trend progress, like I did at the start of 2025. In retrospect, I think I was still too bullish ex-ante.
- ↩︎
It’s unclear what level of capability (in terms of e.g. METR’s notion of horizon-length) corresponds to being able to generate high quality RL environments mostly autonomously (as in, humans provide suggestions and AIs execute). I’d guess this probably scales somewhat with the difficulty of the RL environments and has some minimum level of capability. I’d guess making a high quality new class of RL environments takes human engineers around a few days (???), so we’d maybe expect that at least 80% reliability horizon lengths of 24 hours are required which we’d expect in maybe like 3 years. Though, we’d presumably expect substantial gains from lots of AI automation of RL environment construction before this.
I agree with this directionally, and generally tend to believe that certain kinds of straight lines will continue to be straight. But also iirc you rely a lot on the METR trend line, which is (a) noisy, (b) over a short timescale (~2 years[1]), and (c) not a great measure of what people care about[2][3]. I don’t think you should strongly expect that particular line to continue to be straight over a long time period.
(Though I tend to think that it’s more likely to slow down by 2027 than to speed up, relative to what you’d predict by extrapolating the 2024-2025 trend, so I agree with the overall conclusion of the post. And obviously one should have non-trivial probability on each of {slow down, stay the same, speed up}, as you say.)
I’m excluding GPT-2 and GPT-3 which seem very noisy.
Compared to things like Moore’s Law, or energy efficiency of solar, or pretraining perplexity, or other similar things where we observe straight lines.
None of this is to say that the METR graph is bad! It is the best empirical evidence on timelines I’m aware of.
Anecdotally, GPT-5 seems way above trend for real-world agentic coding.
I think METR’s constrained task-set overstated the capabilities of previous models, and the “real-world” performance of GPT-5 in e.g. Codex CLI seems to be much much higher than e.g. Claude Code.
Sonnet/Opus 4 was almost never able to test, debug, and fix end-to-end in our real codebase, and GPT-5 usually can.
I don’t work with RL, but I predict that if you created an RL environment with
1. GPT-5 (released <1 month ago) via Codex CLI
2. Claude Opus 4 (released 4 months ago) via Claude Code
you would see a dramatic difference in robustness/quality/functionality. Perhaps someone could test this.
I’m not sure what drove GPT-5 seeming so much better for agentic coding (plausibly a big RL scale-up + other improvements?), but I do expect recent and upcoming advancements to drive an explosion of RL environment quality/quantity/robustness.
I think this might be a case where, for each codebase, there is a particular model that goes from “not reliable enough to be useful” to “reliable enough to sometimes be useful”—at my workplace, this first happened with Sonnet 3.6 (then called Claude Sonnet 3.5 New) - there was what felt like a step change from 3.5 to 3.6 where previous progress felt less impactful because incremental improvements went from “unable to reliably handle the boilerplate” to “able to reliably handle the boilerplate”, and then later improvements felt less impactful because once you can write the boilerplate, there isn’t really a lot of alpha in doing it better, and none of the models are reliable enough that we trust them to write bits of core business logic where bugs or poor choices can cause subtle data integrity issues years down the line.
I suspect the same is true of e.g. trying to use LLMs to do major version upgrades of frameworks—a team may have a looming django 4 → django 5 migration, and try out every new model on that task. Once one of them is good enough, the upgrade will be done, and then further tasks will mostly be easier ones like minor version updates. So the most impressive task they’ve seen a model do will be that major version upgrade, and it will take some time for more difficult tasks that are still well-scoped, hard to do, and easy to verify to come up.
Hmm, I’ve heard many conflicting anecdotes here. My own experience is that GPT5 is extremely bad at agentic coding compared with eg Opus 4.1 and even Sonnet 4. And that’s not taking time into account. It uses like 10-100x the time sonnet does which makes it mostly worthless to me.
For me it’s only been good at 1-turn stuff, similar to o3-pro (or, my experience with o3-pro). Like I’ll tell it to fix a bug, with detailed info and context and then run it for a while, and it’s pretty good at fixing bugs that way. But if it doesn’t work, I’ll just revert all its changes and fix the bug myself. If there is multi set stuff like fixing a bug and then writing tests. Or implementing module x and hooking it up to interface y, it just isn’t very good.
What app were you using? This sounds very similar to my experience using GPT-5 in Cursor.
Codex CLI is much much better—night and day difference.
I suppose this is good evidence that harness-specific RL was important for GPT-5.
This comports with my experience. GPT5 is better at 1-shot builds, like “get a prototype of a web app that does X.” But it seems to have a harder time than Claude not breaking stuff if my requests are towards an existing large code base, which is the majority of my work. For example, if I say “look through Y documentation, develop a plan for X change, and execute it”—Opus 4.1 tends to do this more reliably.
I think an interesting experiment would be to test different levels of specificity in prompts, across different sorts of codebases. My experience tells me that Claude is better at taking higher level, less specific requests, developing an actionable plan taking the codebase into account, then executing that plan. At least around data engineering type codebases that I’m familiar with.
But this might not be so with, say, web development. Or maybe even data engineering in different contexts. The models might be spiky in subtle ways, where specificity matters more in certain contexts more than others.
What apps have you tried for this, and how recently?
Most of my usage is multi-turn in a 200k line codebase, for what it’s worth. It’s extremely rare that GPT-5 (via Codex CLI) breaks anything.
Non-OpenAI pre-RLVR chatbots might serve as an anchor for how long it takes an AI company to turn an algorithmic idea into a frontier model, after it becomes a clearly worthwhile thing to do. Arguably only Anthropic managed to catch up to OpenAI, and it took them 1.5 years with Sonnet 3.5. Even Google never caught up after 2+ years, their first credibly frontier chatbot is Gemini 2.5 Pro, which is already well into RLVR (and similarly for Grok 4). So it seems reasonable to expect that it would take about 2 years for RLVR-based models to start being done well, somewhere in 2026-2027.
The IMO results probably indicate something about the current lower bound on capabilities in principle, for informally graded tasks such as natural language proofs. This is a lot higher than what finds practical use so far, and improvements in 2026-2027 might be able to capture this kind of thing (without needing the scale of 2026 compute).
I think scraping and filtering MCP servers then RL training to navigate them is largely even if not fully automatable and already being done (cf this for SFT), but doesn’t unlock massive value.
I’d be curious to get a better sense of what sorts of RL environments you’re imagining. Math problems? Video game environments? Complicated multi-agent conversations? The near-term feasibility of AI generating novel RL environments seems like it varies dramatically depending on the answers.
There’s an efficient-market-hypothesis-style argument that everything’s priced in irrespective of the details, but I’m skeptical of that sort of argument in a context where the relevant players are bottlenecked on people and ability to test ideas.
I’m curious about this partly because DeepMind’s recently released Genie 3 (impressive flashy video, blog post) surprised me with how good it is, and seems like it plausibly hits the threshold at which high-quality video game RL environments can be generated at scale much more cheaply than an hour of developer time[1] (potentially triggering the kind of superexponential increase you talk about).
Caveat: I’m not sure how expensive it is in compute; that could potentially offset the decreased cost in developer time.
Agentic software engineering mostly, I don’t think Genie matters.
Somewhat implied but worth noting, both of these trend breaks are not principally algorithmic but hardware-related.
Maybe some level of evidence that future trend-breaking events might also be hardware related, which runs contrary to several projections.
Agreed, cf https://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/
Thanks, I wasn’t aware of this post. (I think it overstates the level of spending we’ll see on the average RL env within a year by maybe 10x or more, but I agree directionally.)
In their BOTEC, it seems you roughly agree with a group size of 64 and 5 reuses per task (since 5 * 64 is between 100 and 1k).
You wrote $0.1 to $1 per rollout, whereas they have in mind 500,000 * $15 / 1M = $7.5. 500,000 doesn’t seem especially high for hard agentic software engineering tasks which often reach into the millions.
Does the disagreement come from:
Thinking the $15 estimate from opportunity cost is too high (so compute cost lower than Mechanize claims)
Expecting most of the RL training to somehow not be end-to-end? (so compute cost lower than Mechanize claims)
Expecting spending per RL environment to be smaller than compute spending, even if within an OOM.
I expect lower cost per rollout on average due to AI companies doing RL on a bunch of smaller tasks and from companies not necessarily using tons of reasoning tokens on most envs. Also, API prices are marked up relative to what companies actually pay on compute which can easily add a factor of 3. If we are just looking at the hardest agentic software engineering environments, then this closes the gap a decent amount.
I expect spending on RL enviroments to be more like 10x lower than RL training compute rather than similar (and I wouldn’t be surprised by a large gap) because it’s hard to massively scale up spending on RL envs effectively in a short period of time while we already have an scaled up industrial process for buying more compute.
I’m more sympathetic to “companies will spend this much on some high quality RL envs” than “the typical RL env will be very expensive”, but I think some disagreement remains.
I think this analysis underestimates just how much compute OA and especially Anthropic currently have to spend on inference. Once they move to more efficient B200/300/Rubin systems, I expect a lot of compute to be freed up and for progress to accelerate.
I think the compute they spend on inference will also just get scaled up over time.
I agree with this especially for e.g. METR tasks, or proxies for how generally smart a model is.
A case for acceleration in enterprise revenue (rather than general smarts) could look like:
So far RL still has been pretty targeted towards coding, research/browsing, math, or being-ok-at-generic-tool-use (taking random MCP servers and making them into environments, like Kimi K2 did but with RL).
It takes SWE time to build custom interfaces for models to work with economically productive software like Excel, Salesforce, or Photoshop. We’re not there yet, at least with publicly released models. Once we are, this suddenly unlocks a massive amount of economic value.
Ultimately I don’t really buy this either, since we already have e.g. some Excel/Sheets integrations that are not great but better than what there was a couple months ago. And increase in breadth of RL environments is probably already factored into the trend somewhat.
ETA: this also matters less if you’re primarily tracking AI R&D capabilities (or it might but indirectly, through driving more investment etc.).
I’d frame the pace of RL environmental progress with a simple 2×2.
Is the task bounded (Codeforces, IMO-style problems) or unbounded (financial analysis using Excel, executive communication using slides, coding in unstructured codebases, design work using Photoshop etc).
Do we have in-house expertise (yes for coding and easy to source for IMO) or not (OpenAI is hiring finance pros this week to help build evals for Financial agents as I am writing this comment). The presence of expertise helps companies build RL environments that better reflect the actual problem space.
That gives a rough order of progress:
Bounded problem + know-how: o3 preview crushed Codeforces in Dec 2024.
Unbounded problem + know-how: the Codex product line.
Unbounded problem + limited know-how: ChatGPT agents still weak at spreadsheets & terrible at slides today, but I expect that to change in 6 to 12 months.
Not sure where Bounded problems with little know how (e.g Frontier Math) falls in this though…
Yeah, revenue trends generally seem less robust regardless. (It doesn’t look like there is a consistent longer running trend except maybe over the last 2 years. I’d also expect revenue to be less stable in general.)