This comports with my experience. GPT5 is better at 1-shot builds, like “get a prototype of a web app that does X.” But it seems to have a harder time than Claude not breaking stuff if my requests are towards an existing large code base, which is the majority of my work. For example, if I say “look through Y documentation, develop a plan for X change, and execute it”—Opus 4.1 tends to do this more reliably.
I think an interesting experiment would be to test different levels of specificity in prompts, across different sorts of codebases. My experience tells me that Claude is better at taking higher level, less specific requests, developing an actionable plan taking the codebase into account, then executing that plan. At least around data engineering type codebases that I’m familiar with.
But this might not be so with, say, web development. Or maybe even data engineering in different contexts. The models might be spiky in subtle ways, where specificity matters more in certain contexts more than others.
A mix of web app + CLI tools, though admittedly I have a lot more usage on Claude Code than Codex CLI, so my perception is biased by using GPT5 more through the chat and the Codex Web App.
This comports with my experience. GPT5 is better at 1-shot builds, like “get a prototype of a web app that does X.” But it seems to have a harder time than Claude not breaking stuff if my requests are towards an existing large code base, which is the majority of my work. For example, if I say “look through Y documentation, develop a plan for X change, and execute it”—Opus 4.1 tends to do this more reliably.
I think an interesting experiment would be to test different levels of specificity in prompts, across different sorts of codebases. My experience tells me that Claude is better at taking higher level, less specific requests, developing an actionable plan taking the codebase into account, then executing that plan. At least around data engineering type codebases that I’m familiar with.
But this might not be so with, say, web development. Or maybe even data engineering in different contexts. The models might be spiky in subtle ways, where specificity matters more in certain contexts more than others.
What apps have you tried for this, and how recently?
Most of my usage is multi-turn in a 200k line codebase, for what it’s worth. It’s extremely rare that GPT-5 (via Codex CLI) breaks anything.
A mix of web app + CLI tools, though admittedly I have a lot more usage on Claude Code than Codex CLI, so my perception is biased by using GPT5 more through the chat and the Codex Web App.