Very preliminary opinion here, I’ve not yet spent enough time messing with it to be confident, but all these “Opus 4.5 in Claude Code can do anything!!!” experiences seem completely alien to mine. I can make Opus 4.5 sort of kind of implement not-entirely-trivial features if I do enough chewing-up and hand-holding and manual bug-reporting (its self-written tests are not sufficient). But it can’t autonomously code its way out of a wet paper bag.
And yes, I’ve been to Twitter, I’ve tried everything people’s been suggesting. We designed a detailed tech specification, a solid architecture, and a step-by-step implementation plan with it beforehand, and I asked it to do test-driven development and to liberally use AskUserQuestionTool at me. I’ve also tried to do the opposite, starting with a minimal “user-facing features” spec and letting it take the wheel. The frustrated tone of this comment wasn’t a factor either, I’ve been aiming to convey myself in a clear and polite manner.[1] None of that worked, I detect basically no change since August.
My current guess is that we have a massive case of this happening. All the people raving about CCO4.5 being an AGI with no limits happen to be using it on some narrow suite of tasks,[2] and everyone else just thinks they have skill issues, so they sit quiet.
Or maybe I indeed have skill issues. We’ll see, I suppose. I’ll keep trying to figure out how to use it/collaborate with it.
I expect there’s indeed some way to wring utility out of LLMs for serious coding projects. But I’m also guessing that most of this frippery:
agents, subagents, their prompts, contexts, memory, modes, permissions, tools, plugins, skills, hooks, MCP, LSP, slash commands, workflows, IDE integrations
– will not end up very useful for that task.
(People often say that AI progress feels slow/stalled-out only if you’re not interacting with frontier LLMs in a technical capacity, if you’re operating off of months-outdated beliefs and chatbot conversations. It’s been the opposite for me: every time I take some break from heavier coding and only update based on other people’s experiences with newer models, I get psyop’d into believing that there’s indeed been a massive leap in AI capabilities, I get concerned, and then I go back and I find my timelines growing to unprecedented lengths.)
None of that worked, I detect basically no change since August.
What sort of codebase are you working on? I work in a 1 million line typescript codebase and Opus 4.5 has been quite a step up from Sonnet 4.5 (which in turn was a step up from the earlier Sonnet/Opus 4 series).
I wouldn’t say I can leave Opus 4.5 on a loose leash by any means, but unlike prior models, using AI agents for 80%-90% of my code modifications (as opposed to in-IDE with autocomplete) has actually become ROI positive for me.
The main game changer is that Opus has simply become smarter about working with large code bases—less hallucinated methods, more research into the codebase before actions are taken, etc.
As a simple example, I’ve had a “real project” benchmark for awhile to convert ~2000 lines of test cases from an old framework to a new one. Opus 4.5 was able to pull it off with relatively minimal initial steering. (showing an example of a converted test case, correcting a few issues around laziness when it did the first 300 line set). Sonnet 4.5′s final state was a bit buggier and more importantly what it actually wrote during initial execution was considerably buggier, requiring it to self-correct from typecheck or test cases failing. (Ultimately, Opus ended up costing similar to Sonnet with a third the wall clock time).
Most of my work is refactoring—in August, I would still have to do most manually given high error rate of LLMs. These days? Opus is incredibly reliable with only vague directions. As another recent example: I had to add a new parameter to a connection object constructor to indicate if it should be read only—Opus was able to readily update dozens call sites correctly based on whether the call sites were using the connection to write.
By no means does it feel like an employee (the ai-2027 agent-1 definition), but it is a powerful tool (getting more powerful through the generations) that has changed how I work.
I don’t think we necessarily disagree, here? I can imagine it being useful in this context, and I nod along at your “no loose leash” and “doesn’t feel like an employee, but a powerful tool” caveats.
What sort of codebase are you working on?
This specifically was an attempt to vibe-code from scratch a note-management app that’s to my tastes; think Obsidian/Logseq, broken down into small modules/steps.
As a simple example, I’ve had a “real project” benchmark for awhile to convert ~2000 lines of test cases from an old framework to a new one
My guess is that “convert this already-written code from this representation/framework/language/factorization to this other one” may be one of the things LLMs are decent at, yep! This is a direction I’m exploring right now: hack the codebase together using some language + factorization in which it’s easy (but perhaps unwise) to write, then try to use LLMs to “compile” it into a better format.
But this isn’t really “vibe-coding”/”describe the spec in natural language and watch the LLM implement it!”/”programming as a job is gone/dramatically transformed!”, the way it’s being advertised. LLMs are not, it seems, actually good at mapping natural-language descriptions into non-hack-y, robust background logic. You need a “code-level” prompt to specify the task precisely enough. And there’s only one way to bring that code into existence if LLMs can’t do it for you.
I agree with you that “Opus 4.5 can do anything” is overselling it and there is too much hype around acting like these things are fully autonomous software architects. I did want to note though that Opus 4.5 is a vast improvement and praise is warranted.
My guess is that “convert this already-written code from this representation/framework/language/factorization to this other one” may be one of the things LLMs are decent at, yep!
Agreed, I’m relying on their “localized” intelligence to get work done fast. Where Anthropic has improved their models significantly this year is A) improving task “planning”, e.g. how to extract the relevant context needed to make decisions LLMs broadly already could do, B) editing code in sane ways that doesn’t break things (at the beginning of the year, Claude would chew up any 4000+ LOC file just from wrong tool use). In some ways, this isn’t necessarily higher “intelligence” (Claude models remain relatively dumber on solving novel problems compared to frontier GPT/Gemini) but proper training in the coding domain.
But this isn’t really “vibe-coding”/”describe the spec in natural language and watch the LLM implement it!”/”programming as a job is gone/dramatically transformed!”, the way it’s being advertised. LLMs are not, it seems, actually good at mapping natural-language descriptions into non-hack-y, robust background logic. You need a “code-level” prompt to specify the task precisely enough
It’s a mixed bag. In practice, I can vibe code 100 line isolated modules from natural language, though it does require inspecting the code for bugs and then providing the model feedback and it fixes things. Still much faster than hand writing and slightly faster than “intention” auto-complete with Cursor.
But overall, yes, I agree that I continue to do all the systems architecture and it feels like I’m offloading more well defined tasks to the model.
Very preliminary opinion here, I’ve not yet spent enough time messing with it to be confident, but all these “Opus 4.5 in Claude Code can do anything!!!” experiences seem completely alien to mine. I can make Opus 4.5 sort of kind of implement not-entirely-trivial features if I do enough chewing-up and hand-holding and manual bug-reporting (its self-written tests are not sufficient). But it can’t autonomously code its way out of a wet paper bag.
And yes, I’ve been to Twitter, I’ve tried everything people’s been suggesting. We designed a detailed tech specification, a solid architecture, and a step-by-step implementation plan with it beforehand, and I asked it to do test-driven development and to liberally use AskUserQuestionTool at me. I’ve also tried to do the opposite, starting with a minimal “user-facing features” spec and letting it take the wheel. The frustrated tone of this comment wasn’t a factor either, I’ve been aiming to convey myself in a clear and polite manner.[1] None of that worked, I detect basically no change since August.
My current guess is that we have a massive case of this happening. All the people raving about CCO4.5 being an AGI with no limits happen to be using it on some narrow suite of tasks,[2] and everyone else just thinks they have skill issues, so they sit quiet.
Or maybe I indeed have skill issues. We’ll see, I suppose. I’ll keep trying to figure out how to use it/collaborate with it.
I expect there’s indeed some way to wring utility out of LLMs for serious coding projects. But I’m also guessing that most of this frippery:
– will not end up very useful for that task.
(People often say that AI progress feels slow/stalled-out only if you’re not interacting with frontier LLMs in a technical capacity, if you’re operating off of months-outdated beliefs and chatbot conversations. It’s been the opposite for me: every time I take some break from heavier coding and only update based on other people’s experiences with newer models, I get psyop’d into believing that there’s indeed been a massive leap in AI capabilities, I get concerned, and then I go back and I find my timelines growing to unprecedented lengths.)
I suppose I haven’t tried the opposite on that matter. Maybe you do need to yell at LLMs for them to start working?
E. g., maybe it’s:
Various very “shallow” simple scripts for one-off data transformations/batch processing.
Python ML libraries, which AGI labs have prioritized for obvious reasons.
HTML + CSS and other simple frontend stuff.
Specific science/math algorithms that are conceptually complex for a non-expert human, but are actually highly templated and memorizable.
Various templated APIs, e. g. web stuff.
Vibe-y conversations about architectures and implementations, with no actual code being pushed to production.
Linear combinations of the above.
What sort of codebase are you working on? I work in a 1 million line typescript codebase and Opus 4.5 has been quite a step up from Sonnet 4.5 (which in turn was a step up from the earlier Sonnet/Opus 4 series).
I wouldn’t say I can leave Opus 4.5 on a loose leash by any means, but unlike prior models, using AI agents for 80%-90% of my code modifications (as opposed to in-IDE with autocomplete) has actually become ROI positive for me.
The main game changer is that Opus has simply become smarter about working with large code bases—less hallucinated methods, more research into the codebase before actions are taken, etc.
As a simple example, I’ve had a “real project” benchmark for awhile to convert ~2000 lines of test cases from an old framework to a new one. Opus 4.5 was able to pull it off with relatively minimal initial steering. (showing an example of a converted test case, correcting a few issues around laziness when it did the first 300 line set). Sonnet 4.5′s final state was a bit buggier and more importantly what it actually wrote during initial execution was considerably buggier, requiring it to self-correct from typecheck or test cases failing. (Ultimately, Opus ended up costing similar to Sonnet with a third the wall clock time).
Most of my work is refactoring—in August, I would still have to do most manually given high error rate of LLMs. These days? Opus is incredibly reliable with only vague directions. As another recent example: I had to add a new parameter to a connection object constructor to indicate if it should be read only—Opus was able to readily update dozens call sites correctly based on whether the call sites were using the connection to write.
By no means does it feel like an employee (the ai-2027 agent-1 definition), but it is a powerful tool (getting more powerful through the generations) that has changed how I work.
I don’t think we necessarily disagree, here? I can imagine it being useful in this context, and I nod along at your “no loose leash” and “doesn’t feel like an employee, but a powerful tool” caveats.
This specifically was an attempt to vibe-code from scratch a note-management app that’s to my tastes; think Obsidian/Logseq, broken down into small modules/steps.
My guess is that “convert this already-written code from this representation/framework/language/factorization to this other one” may be one of the things LLMs are decent at, yep! This is a direction I’m exploring right now: hack the codebase together using some language + factorization in which it’s easy (but perhaps unwise) to write, then try to use LLMs to “compile” it into a better format.
But this isn’t really “vibe-coding”/”describe the spec in natural language and watch the LLM implement it!”/”programming as a job is gone/dramatically transformed!”, the way it’s being advertised. LLMs are not, it seems, actually good at mapping natural-language descriptions into non-hack-y, robust background logic. You need a “code-level” prompt to specify the task precisely enough. And there’s only one way to bring that code into existence if LLMs can’t do it for you.
I agree with you that “Opus 4.5 can do anything” is overselling it and there is too much hype around acting like these things are fully autonomous software architects. I did want to note though that Opus 4.5 is a vast improvement and praise is warranted.
Agreed, I’m relying on their “localized” intelligence to get work done fast. Where Anthropic has improved their models significantly this year is A) improving task “planning”, e.g. how to extract the relevant context needed to make decisions LLMs broadly already could do, B) editing code in sane ways that doesn’t break things (at the beginning of the year, Claude would chew up any 4000+ LOC file just from wrong tool use). In some ways, this isn’t necessarily higher “intelligence” (Claude models remain relatively dumber on solving novel problems compared to frontier GPT/Gemini) but proper training in the coding domain.
It’s a mixed bag. In practice, I can vibe code 100 line isolated modules from natural language, though it does require inspecting the code for bugs and then providing the model feedback and it fixes things. Still much faster than hand writing and slightly faster than “intention” auto-complete with Cursor.
But overall, yes, I agree that I continue to do all the systems architecture and it feels like I’m offloading more well defined tasks to the model.