Anecdotally, GPT-5 seems way above trend for real-world agentic coding.
I think METR’s constrained task-set overstated the capabilities of previous models, and the “real-world” performance of GPT-5 in e.g. Codex CLI seems to be much much higher than e.g. Claude Code.
Sonnet/Opus 4 was almost never able to test, debug, and fix end-to-end in our real codebase, and GPT-5 usually can.
I don’t work with RL, but I predict that if you created an RL environment with 1. GPT-5 (released <1 month ago) via Codex CLI 2. Claude Opus 4 (released 4 months ago) via Claude Code you would see a dramatic difference in robustness/quality/functionality. Perhaps someone could test this.
I’m not sure what drove GPT-5 seeming so much better for agentic coding (plausibly a big RL scale-up + other improvements?), but I do expect recent and upcoming advancements to drive an explosion of RL environment quality/quantity/robustness.
I think this might be a case where, for each codebase, there is a particular model that goes from “not reliable enough to be useful” to “reliable enough to sometimes be useful”—at my workplace, this first happened with Sonnet 3.6 (then called Claude Sonnet 3.5 New) - there was what felt like a step change from 3.5 to 3.6 where previous progress felt less impactful because incremental improvements went from “unable to reliably handle the boilerplate” to “able to reliably handle the boilerplate”, and then later improvements felt less impactful because once you can write the boilerplate, there isn’t really a lot of alpha in doing it better, and none of the models are reliable enough that we trust them to write bits of core business logic where bugs or poor choices can cause subtle data integrity issues years down the line.
I suspect the same is true of e.g. trying to use LLMs to do major version upgrades of frameworks—a team may have a looming django 4 → django 5 migration, and try out every new model on that task. Once one of them is good enough, the upgrade will be done, and then further tasks will mostly be easier ones like minor version updates. So the most impressive task they’ve seen a model do will be that major version upgrade, and it will take some time for more difficult tasks that are still well-scoped, hard to do, and easy to verify to come up.
Hmm, I’ve heard many conflicting anecdotes here. My own experience is that GPT5 is extremely bad at agentic coding compared with eg Opus 4.1 and even Sonnet 4. And that’s not taking time into account. It uses like 10-100x the time sonnet does which makes it mostly worthless to me.
For me it’s only been good at 1-turn stuff, similar to o3-pro (or, my experience with o3-pro). Like I’ll tell it to fix a bug, with detailed info and context and then run it for a while, and it’s pretty good at fixing bugs that way. But if it doesn’t work, I’ll just revert all its changes and fix the bug myself. If there is multi set stuff like fixing a bug and then writing tests. Or implementing module x and hooking it up to interface y, it just isn’t very good.
This comports with my experience. GPT5 is better at 1-shot builds, like “get a prototype of a web app that does X.” But it seems to have a harder time than Claude not breaking stuff if my requests are towards an existing large code base, which is the majority of my work. For example, if I say “look through Y documentation, develop a plan for X change, and execute it”—Opus 4.1 tends to do this more reliably.
I think an interesting experiment would be to test different levels of specificity in prompts, across different sorts of codebases. My experience tells me that Claude is better at taking higher level, less specific requests, developing an actionable plan taking the codebase into account, then executing that plan. At least around data engineering type codebases that I’m familiar with.
But this might not be so with, say, web development. Or maybe even data engineering in different contexts. The models might be spiky in subtle ways, where specificity matters more in certain contexts more than others.
A mix of web app + CLI tools, though admittedly I have a lot more usage on Claude Code than Codex CLI, so my perception is biased by using GPT5 more through the chat and the Codex Web App.
Anecdotally, GPT-5 seems way above trend for real-world agentic coding.
I think METR’s constrained task-set overstated the capabilities of previous models, and the “real-world” performance of GPT-5 in e.g. Codex CLI seems to be much much higher than e.g. Claude Code.
Sonnet/Opus 4 was almost never able to test, debug, and fix end-to-end in our real codebase, and GPT-5 usually can.
I don’t work with RL, but I predict that if you created an RL environment with
1. GPT-5 (released <1 month ago) via Codex CLI
2. Claude Opus 4 (released 4 months ago) via Claude Code
you would see a dramatic difference in robustness/quality/functionality. Perhaps someone could test this.
I’m not sure what drove GPT-5 seeming so much better for agentic coding (plausibly a big RL scale-up + other improvements?), but I do expect recent and upcoming advancements to drive an explosion of RL environment quality/quantity/robustness.
I think this might be a case where, for each codebase, there is a particular model that goes from “not reliable enough to be useful” to “reliable enough to sometimes be useful”—at my workplace, this first happened with Sonnet 3.6 (then called Claude Sonnet 3.5 New) - there was what felt like a step change from 3.5 to 3.6 where previous progress felt less impactful because incremental improvements went from “unable to reliably handle the boilerplate” to “able to reliably handle the boilerplate”, and then later improvements felt less impactful because once you can write the boilerplate, there isn’t really a lot of alpha in doing it better, and none of the models are reliable enough that we trust them to write bits of core business logic where bugs or poor choices can cause subtle data integrity issues years down the line.
I suspect the same is true of e.g. trying to use LLMs to do major version upgrades of frameworks—a team may have a looming django 4 → django 5 migration, and try out every new model on that task. Once one of them is good enough, the upgrade will be done, and then further tasks will mostly be easier ones like minor version updates. So the most impressive task they’ve seen a model do will be that major version upgrade, and it will take some time for more difficult tasks that are still well-scoped, hard to do, and easy to verify to come up.
Hmm, I’ve heard many conflicting anecdotes here. My own experience is that GPT5 is extremely bad at agentic coding compared with eg Opus 4.1 and even Sonnet 4. And that’s not taking time into account. It uses like 10-100x the time sonnet does which makes it mostly worthless to me.
For me it’s only been good at 1-turn stuff, similar to o3-pro (or, my experience with o3-pro). Like I’ll tell it to fix a bug, with detailed info and context and then run it for a while, and it’s pretty good at fixing bugs that way. But if it doesn’t work, I’ll just revert all its changes and fix the bug myself. If there is multi set stuff like fixing a bug and then writing tests. Or implementing module x and hooking it up to interface y, it just isn’t very good.
What app were you using? This sounds very similar to my experience using GPT-5 in Cursor.
Codex CLI is much much better—night and day difference.
I suppose this is good evidence that harness-specific RL was important for GPT-5.
This comports with my experience. GPT5 is better at 1-shot builds, like “get a prototype of a web app that does X.” But it seems to have a harder time than Claude not breaking stuff if my requests are towards an existing large code base, which is the majority of my work. For example, if I say “look through Y documentation, develop a plan for X change, and execute it”—Opus 4.1 tends to do this more reliably.
I think an interesting experiment would be to test different levels of specificity in prompts, across different sorts of codebases. My experience tells me that Claude is better at taking higher level, less specific requests, developing an actionable plan taking the codebase into account, then executing that plan. At least around data engineering type codebases that I’m familiar with.
But this might not be so with, say, web development. Or maybe even data engineering in different contexts. The models might be spiky in subtle ways, where specificity matters more in certain contexts more than others.
What apps have you tried for this, and how recently?
Most of my usage is multi-turn in a 200k line codebase, for what it’s worth. It’s extremely rare that GPT-5 (via Codex CLI) breaks anything.
A mix of web app + CLI tools, though admittedly I have a lot more usage on Claude Code than Codex CLI, so my perception is biased by using GPT5 more through the chat and the Codex Web App.