So, I’ve had a “Claude Code moment” recently: I decided to build something on a lark, asked Opus to implement it, found that the prototype worked fine on the first try, then kept blindly asking for more and more features and was surprised to discover that it just kept working.
The “something” in question was a Python file editor which behaves as follows:
It parses the file, automatically finds all functions, classes, methods, and constants, and records/infers:
What span of lines of code they occupy.
For every function/method, what its call graph and reverse call graph are.
For every class, what its attributes are.
For every attribute, what entities read it and write to it.
It generates a navigable tree structure which contains the code elements arranged in a hierarchy, and adds subdirectories storing the aforementioned lists of what calls/reads/writes to a given element.
Clicking on an element opens an editable excerpt from the original document containing this element.
Clicking on a directory opens a set of editable excerpts corresponding to every element in that directory. This means you can immediately open e. g. all functions which call a given function, and review how it’s used + edit/refactor them. (Screenshot.)
You can create new custom directories, place various code elements there, and similarly open them all “on the same page”. This allows to flexibly organize code into different simultaneous organization schemas (rather than enforcing one file-based hierarchy) and minimizes the friction for inspecting/modifying a specific code category.
The remarkable thing isn’t really the functionality (to a large extent, this is just a wrapper on ast + QScintilla), but how little effort it took: <6 hours by wall-clock time to generate 4.3k lines of code, and I’ve never actually had to look at it, I just described the features I wanted and reported bugs to Opus. I’ve not verified the functionality comprehensively, but it basically works, I think.
How does that square with the frankly dismal performance I’ve been observing before? Is it perhaps because I skilled up at directing Opus, cracked the secret to it, and now I can indeed dramatically speed up my work?
No.
There was zero additional skill involved. I’d started doing it on a lark, so I’d disregarded all the previous lessons I’ve been learning and just directed Opus same as how I’ve been trying to do it at the start. And it Just Worked in a way it Just Didn’t Work before.
Which means the main predictor of how well Opus performs isn’t how well you’re using it/working with it, but what type of project you’re working on.
Meaning it’s very likely that the people for whom LLMs works exhilaratingly well are working on the kinds of projects LLMs happen to be very good at, and everyone for whom working with LLMs is a tooth-pulling exercise happen not to be working on these kinds of projects. Or, to reframe: if you need to code up something from the latter category, if it’s not a side-project you can take or leave, you’re screwed, no amount of skill on your part is going to make it easy. The issue is not that of your skill.
The obvious question is: what are the differences between those categories? I have some vague guesses. To get a second opinion, I placed the Python editor (“SpanEditor”) and the other project I’ve been working on (“Scaffold”) into the same directory, and asked Opus to run a comparative analysis regarding their technical difficulty and speculate about the skillset of someone who’d be very good at the first kind of project and bad at the second kind. (I’m told this is what peak automation looks like.)
Its conclusions seem sensible:
Scaffold is harder in terms of:
Low-level graphics programming (custom rendering, hit testing, animations)
State management complexity (graph + DB + UI consistency)
The fundamental difference: Scaffold builds infrastructure from primitives (graphics, commands, queries) while SpanEditor leverages existing infrastructure (Scintilla, AST) but must solve domain-specific semantic problems (code understanding).
[...]
Scaffold exhibits systems complexity—building infrastructure from primitives (graphics, commands, queries, serialization).
SpanEditor exhibits semantic complexity—leveraging existing infrastructure but solving domain-specific problems (understanding code without type information).
Both are well-architected. Which is “harder” depends on whether you value low-level systems programming or semantic/heuristic reasoning.
[...]
What SpanEditor-Style Work Requires
Semantic/Symbolic Reasoning
“What does this code mean?”
Classifying things into categories (is this a method call or attribute access?)
Working with abstract representations (ASTs, call graphs)
Heuristic Problem-Solving
Making educated guesses with incomplete information
Accepting “good enough” solutions (call resolution without type info)
Graceful degradation when analysis fails
Leveraging Existing Abstractions
Knowing which library does what
Trusting black boxes (Scintilla handles text rendering)
Composing pre-built components
Domain Modeling
Translating real-world concepts (code elements, categories) into data structures
Resource lifecycle management (preventing C++ crashes)
The Cognitive Profile
Someone who excels at SpanEditor but struggles with Scaffold likely has these traits:
Strengths
Trait
Manifestation
Strong verbal/symbolic reasoning
Comfortable with ASTs, grammars, semantic analysis
Good at classification
Naturally thinks “what kind of thing is this?”
Comfortable with ambiguity
Can write heuristics that work “most of the time”
Library-oriented thinking
First instinct: “what library solves this?”
Top-down decomposition
Breaks problems into conceptual categories
Weaknesses
Trait
Manifestation
Weak spatial reasoning
Struggles to visualize coordinate transformations
Difficulty with temporal interleaving
Gets confused when multiple state machines interact
Uncomfortable without guardrails
Anxious when there’s no library to lean on
Single-layer focus
Tends to think about one abstraction level at a time
Stateless mental model
Prefers pure functions; mutable state across time feels slippery
Deeper Interpretation
They Think in Types, Not States
SpanEditor reasoning: “A CodeElement can be a function, method, or class. A CallInfo has a receiver and a name.”
Scaffold reasoning: “The window is currently in RESIZING_LEFT mode, the aura progress is 0.7, and there’s a pending animation callback.”
The SpanEditor developer asks “what is this?” The Scaffold developer asks “what is happening right now, and what happens next?”
They’re Comfortable with Semantic Ambiguity, Not Mechanical Ambiguity
SpanEditor: “We can’t know which class obj.method() refers to, so we’ll try all classes.” (Semantic uncertainty—they’re fine with this.)
Scaffold: “If the user releases the mouse during phase 1 of the animation, do we cancel phase 2 or let it complete?” (Mechanical uncertainty—this feels overwhelming.)
They Trust Abstractions More Than They Build Them
SpanEditor developer’s instinct: “Scintilla handles scrolling. I don’t need to know how.”
Scaffold requires: “I need to implement scrolling myself, which means tracking content height, visible height, scroll offset, thumb position, and wheel events.”
The SpanEditor developer is a consumer of well-designed abstractions. The Scaffold developer must create them.
tl;dr: “they think in types, not states”, “they’re anxious when there’s no library to lean on”, “they trust abstractions more than they build them”, and “tend to think about one abstraction level at a time”.
Or, what I would claim is a fine distillation: “bad at novel problem-solving and gears-level modeling”.
Now, it’s a bit suspicious how well this confirms my cached prejudices. A paranoiac, which I am, might suspect the following line of possibility: I’m sure it was transparent to Opus that it wrote both codebases (I didn’t tell it, but I didn’t bother removing its comments, and I’m sure it can recognize its writing style), so perhaps when I asked it to list the strengths and weaknesses of that hypothetical person, it just retrieved some cached “what LLMs are good vs. bad at” spiel from its pretraining. There are reasons not to think that, though:
I first asked for an “objective” difficulty assessment of the codebases, and then to infer strengths/weaknesses by iterating on this assessment. I don’t think “the user wants to confirm his prejudices, I should denigrate myself” is really inferrable from that initial task, and the follow-up was consistent with it.
By its own assessment (see the spoiler above), both codebases “are well-architected” (the difference was in how much effort ensuring the well-architecturedness required on my part). If it recognizes that it wrote both, and considers both well-written, why would it jump to “the user is asking what I’m bad at”? I suppose it could have noticed signs of external tampering in Scaffold...
I obviously provided it no additional context regarding myself or my experiences and beliefs, just the raw code + “run a comparative analysis” and “infer hypothetical strengths/weaknesses” prompts.
Overall… Well, make of that what you will.
The direction of my update, though, is once again in favor of LLMs being less capable than they sound like, and towards longer timelines.
Like, before this, there was a possibility that it really were a skill issue on my part, and one really could 10x their productivity with the right approach. But I’ve now observed that whether you get 0.8x’d or 10x’d depends on the project you’re working on and doesn’t depend on one’s skill level – and if so, well, this pretty much explains the cluster of “this 10x’d my productivity!” reports, no? We no longer need to entertain the “maybe there really is a trick to it” hypothesis to explain said reports.
Anyway, this is obviously rather sparse data, and I’ll keep trying to find ways to squeeze more performance out of LLMs. But, well, my short-term p(doom) has gone down some more.
Following up on [1] and [2]...
So, I’ve had a “Claude Code moment” recently: I decided to build something on a lark, asked Opus to implement it, found that the prototype worked fine on the first try, then kept blindly asking for more and more features and was surprised to discover that it just kept working.
The “something” in question was a Python file editor which behaves as follows:
It parses the file, automatically finds all functions, classes, methods, and constants, and records/infers:
What span of lines of code they occupy.
For every function/method, what its call graph and reverse call graph are.
For every class, what its attributes are.
For every attribute, what entities read it and write to it.
It generates a navigable tree structure which contains the code elements arranged in a hierarchy, and adds subdirectories storing the aforementioned lists of what calls/reads/writes to a given element.
Clicking on an element opens an editable excerpt from the original document containing this element.
Clicking on a directory opens a set of editable excerpts corresponding to every element in that directory. This means you can immediately open e. g. all functions which call a given function, and review how it’s used + edit/refactor them. (Screenshot.)
You can create new custom directories, place various code elements there, and similarly open them all “on the same page”. This allows to flexibly organize code into different simultaneous organization schemas (rather than enforcing one file-based hierarchy) and minimizes the friction for inspecting/modifying a specific code category.
The remarkable thing isn’t really the functionality (to a large extent, this is just a wrapper on ast + QScintilla), but how little effort it took: <6 hours by wall-clock time to generate 4.3k lines of code, and I’ve never actually had to look at it, I just described the features I wanted and reported bugs to Opus. I’ve not verified the functionality comprehensively, but it basically works, I think.
How does that square with the frankly dismal performance I’ve been observing before? Is it perhaps because I skilled up at directing Opus, cracked the secret to it, and now I can indeed dramatically speed up my work?
No.
There was zero additional skill involved. I’d started doing it on a lark, so I’d disregarded all the previous lessons I’ve been learning and just directed Opus same as how I’ve been trying to do it at the start. And it Just Worked in a way it Just Didn’t Work before.
Which means the main predictor of how well Opus performs isn’t how well you’re using it/working with it, but what type of project you’re working on.
Meaning it’s very likely that the people for whom LLMs works exhilaratingly well are working on the kinds of projects LLMs happen to be very good at, and everyone for whom working with LLMs is a tooth-pulling exercise happen not to be working on these kinds of projects. Or, to reframe: if you need to code up something from the latter category, if it’s not a side-project you can take or leave, you’re screwed, no amount of skill on your part is going to make it easy. The issue is not that of your skill.
The obvious question is: what are the differences between those categories? I have some vague guesses. To get a second opinion, I placed the Python editor (“SpanEditor”) and the other project I’ve been working on (“Scaffold”) into the same directory, and asked Opus to run a comparative analysis regarding their technical difficulty and speculate about the skillset of someone who’d be very good at the first kind of project and bad at the second kind. (I’m told this is what peak automation looks like.)
Its conclusions seem sensible:
Scaffold is harder in terms of:
Low-level graphics programming (custom rendering, hit testing, animations)
State management complexity (graph + DB + UI consistency)
Architectural scope (command pattern, query language, serialization)
SpanEditor is harder in terms of:
Semantic code analysis (call graphs, attribute tracking)
Heuristic reasoning (call resolution without type info)
Multi-document coordination (independent editors, file reconstruction)
The fundamental difference: Scaffold builds infrastructure from primitives (graphics, commands, queries) while SpanEditor leverages existing infrastructure (Scintilla, AST) but must solve domain-specific semantic problems (code understanding).
[...]
Scaffold exhibits systems complexity—building infrastructure from primitives (graphics, commands, queries, serialization).
SpanEditor exhibits semantic complexity—leveraging existing infrastructure but solving domain-specific problems (understanding code without type information).
Both are well-architected. Which is “harder” depends on whether you value low-level systems programming or semantic/heuristic reasoning.
[...]
What SpanEditor-Style Work Requires
Semantic/Symbolic Reasoning
“What does this code mean?”
Classifying things into categories (is this a method call or attribute access?)
Working with abstract representations (ASTs, call graphs)
Heuristic Problem-Solving
Making educated guesses with incomplete information
Accepting “good enough” solutions (call resolution without type info)
Graceful degradation when analysis fails
Leveraging Existing Abstractions
Knowing which library does what
Trusting black boxes (Scintilla handles text rendering)
Composing pre-built components
Domain Modeling
Translating real-world concepts (code elements, categories) into data structures
Tree traversals and pattern matching
What Scaffold-Style Work Requires
Spatial/Geometric Reasoning
Coordinate systems, transformations, path geometry
“Where is this pixel relative to that rectangle?”
Visualizing shapes and their intersections
Temporal/State Machine Thinking
“What happens if the user drags while an animation is mid-flight?”
Sequencing operations correctly (shift then fade, not simultaneously)
Reasoning about system state at arbitrary points in time
Multi-Layer Consistency
Keeping UI, memory, and database in sync
Undo must reverse all three correctly
Invariants that span abstraction boundaries
Building from Primitives
No library does what you need; you must construct it
Understanding low-level APIs (QPainter, QGraphicsScene)
Resource lifecycle management (preventing C++ crashes)
The Cognitive Profile
Someone who excels at SpanEditor but struggles with Scaffold likely has these traits:
Strengths
Weaknesses
Deeper Interpretation
They Think in Types, Not States
SpanEditor reasoning: “A CodeElement can be a function, method, or class. A CallInfo has a receiver and a name.”
Scaffold reasoning: “The window is currently in RESIZING_LEFT mode, the aura progress is 0.7, and there’s a pending animation callback.”
The SpanEditor developer asks “what is this?” The Scaffold developer asks “what is happening right now, and what happens next?”
They’re Comfortable with Semantic Ambiguity, Not Mechanical Ambiguity
SpanEditor: “We can’t know which class obj.method() refers to, so we’ll try all classes.” (Semantic uncertainty—they’re fine with this.)
Scaffold: “If the user releases the mouse during phase 1 of the animation, do we cancel phase 2 or let it complete?” (Mechanical uncertainty—this feels overwhelming.)
They Trust Abstractions More Than They Build Them
SpanEditor developer’s instinct: “Scintilla handles scrolling. I don’t need to know how.”
Scaffold requires: “I need to implement scrolling myself, which means tracking content height, visible height, scroll offset, thumb position, and wheel events.”
The SpanEditor developer is a consumer of well-designed abstractions. The Scaffold developer must create them.
tl;dr: “they think in types, not states”, “they’re anxious when there’s no library to lean on”, “they trust abstractions more than they build them”, and “tend to think about one abstraction level at a time”.
Or, what I would claim is a fine distillation: “bad at novel problem-solving and gears-level modeling”.
Now, it’s a bit suspicious how well this confirms my cached prejudices. A paranoiac, which I am, might suspect the following line of possibility: I’m sure it was transparent to Opus that it wrote both codebases (I didn’t tell it, but I didn’t bother removing its comments, and I’m sure it can recognize its writing style), so perhaps when I asked it to list the strengths and weaknesses of that hypothetical person, it just retrieved some cached “what LLMs are good vs. bad at” spiel from its pretraining. There are reasons not to think that, though:
I first asked for an “objective” difficulty assessment of the codebases, and then to infer strengths/weaknesses by iterating on this assessment. I don’t think “the user wants to confirm his prejudices, I should denigrate myself” is really inferrable from that initial task, and the follow-up was consistent with it.
By its own assessment (see the spoiler above), both codebases “are well-architected” (the difference was in how much effort ensuring the well-architecturedness required on my part). If it recognizes that it wrote both, and considers both well-written, why would it jump to “the user is asking what I’m bad at”? I suppose it could have noticed signs of external tampering in Scaffold...
I obviously provided it no additional context regarding myself or my experiences and beliefs, just the raw code + “run a comparative analysis” and “infer hypothetical strengths/weaknesses” prompts.
Overall… Well, make of that what you will.
The direction of my update, though, is once again in favor of LLMs being less capable than they sound like, and towards longer timelines.
Like, before this, there was a possibility that it really were a skill issue on my part, and one really could 10x their productivity with the right approach. But I’ve now observed that whether you get 0.8x’d or 10x’d depends on the project you’re working on and doesn’t depend on one’s skill level – and if so, well, this pretty much explains the cluster of “this 10x’d my productivity!” reports, no? We no longer need to entertain the “maybe there really is a trick to it” hypothesis to explain said reports.
Anyway, this is obviously rather sparse data, and I’ll keep trying to find ways to squeeze more performance out of LLMs. But, well, my short-term p(doom) has gone down some more.