On the other hand, eliminating hidden coupling — pieces of code that must be changed in tandem and yet do not result in compile errors when they fall out of synch (taught in Unit 2 of our course) — is becoming more important. The ability for code to become intertwined far outpaces the model’s ability to read and think through large codebases, and they lack even the feeble human memory to track such secret dependences. Though I and many others have experienced nasty bugs caused by AI-induced hidden coupling, this is something that I (knock on wood) do not see changing at the model level. Our refactoring agent improves a lot – but as hidden coupling is fundamentally about understanding design intent, preventing it still needs a knowledgeable human in the driver’s seat, one using Command Center’s understandability and walkthrough features more than the refactoring features.
The upshot of all this: when one of our contractors switched to a company of lesser code quality, he found the effectiveness of his Claude Code dropped tremendously. And we’ve been working on a set of benchmarks that measure code quality by the amount of work needed and bugs produced by AIs doing follow-on work.
I’d be interested in more such code quality-oriented benchmarks. For now the only thing I’m aware of from Anthropic internally, at least that they’ve mentioned publicly, is “it works” and “another engineer can understand and build upon it”:
The code that Claude writes is “good” and improving. “Good code” means two things: it works, and it is written in a manner that allows another engineer to understand it and build upon it. On the first criterion, the evidence is clear. The rate at which Anthropic staff correct, redirect, or take over mid-task from Claude has been falling steadily for a year, including on the most complex and open-ended tasks. This means problems with no clear specification, where the engineer isn’t sure what the answer looks like. This is evident in Claude’s success rate over time on tasks of different difficulties, as shown in the graph below. Claude writes code that works.
How to read this: Session success is determined by a Claude judge; a session is deemed successful if the Claude Code agent clearly succeeded at the user’s tasks without requiring corrections. Changes in workloads can lead to short-term fluctuations in success rates.
On the most open-ended tasks, Claude’s success rate reached 76% in May 2026, up 50 percentage points in six months. To give an example of tasks in this difficulty tier, a routine upgrade began crashing tens of thousands of training jobs. An engineer pointed Claude at the live incident with little more than some text content and cluster access. Working through the running jobs and testing one environment setting at a time, Claude isolated the single obscure debugging flag that was triggering the crash, reproduced it reliably, and confirmed a fix. In about two hours, Claude delivered what would normally be two to three days of work.
The second criterion is writing code that another engineer can understand and build on. Here the gap between humans and AI persists, but is closing fast. There isn’t full consensus among staff at Anthropic, but many believe that the Claude-written code was still worse in quality than human-written code at Anthropic in late 2025, and is roughly at parity today. We expect it to be better within the year.
You reminded me of James Koppel’s latest newsletter:
I’d be interested in more such code quality-oriented benchmarks. For now the only thing I’m aware of from Anthropic internally, at least that they’ve mentioned publicly, is “it works” and “another engineer can understand and build upon it”: