The complaint about the code for image resizing seems valid and is the exact kind of problem that’s common in AI code (layering special cases on top of functions instead of stepping back to design a coherent system).
The rest of the complaints are about how the harness works, and I think they miss the point. Obviously, Anthropic would prefer if they could make Claude always do the right thing without assistance, but they can’t, so piling hacks to check if Claude did things and remind it of what it’s supposed to be doing is the (formerly) secret sauce that makes Claude Code work how users want it to.
This reminds me of writing code to parse data from spreadsheets. You could assume that all of your users are robots who always write dates as UTC ISO 8601 timestamps, but then your product won’t work. The reality is that a “hacky” thousand line spreadsheet parser is better than one that assumes unrealistic behavior, and I think Claude Code is a similar case.
(I’m only responding to the problems mentioned by that thread. It’s likely there are other problems in this codebase. Also to the extent that some of the code is bad, they’re clearly taking that trade-off on purpose to get more speed, and that’s probably the right choice here.)
Senior SWE at Alphabet: the complaints read to me like stylistic nits, and not particularly good ones.
Ex: 1) As Zack says, the negative keyword regex is a very reasonable way to (extremely quickly & roughly) get a sense of negative sentiment. Not all sentiment analysis is load bearing, so doing something fast & cheap often makes sense. 2) Complaints about detailed comment explanations is a weird flex. If you are doing something unusual in your code, it is sometimes helpful to include a paragraph explaining why (otherwise later folks need to rederive its purpose). 3) He laughs at the instructions to not introduce security vulnerabilities (and lists specific types). This is IMO a bad take. Reminding ppl (& LLMs) about common error patterns really does help avoid that.
Some of the code is not ideal (very little code in existence is), but the complaints in question IMO have a worse hit rate than if you asked your favorite LLM to critique the code.
I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS – this seems silly, but I’ve seen similar (shorter) messages in human codebases and they work.
“Be careful not to introduce security vulnerabilities such as...” – fine
Claude can be used for pentesting – fine
Environment variable for user type with elevated privileges – bad, but unfortunately common
Regex for swear words – seems fine, it’s cheaper than an LLM call, and not actually important enough to deserve one
Subagent to verify another agent’s run – actually good, and the author seems to be misunderstanding why it’s useful
Prompting the model to use a tool call – seems fine? My guess is that this was initially more hardcoded, and when the models got better they found it more effective to switch to an LLM call. And the prompt will likely result in the model debugging if something goes wrong, which is helpful
Long LLM comment – IMO a genuinely helpful comment
Reading the whole file instead of just validating the bytes – this genuinely seems inefficient and wasteful
Several cases of code duplication in slightly different styles – clunky and messy, and one of the big problems with LLM code
System reminder mechanism – seems pretty sketchy
Image example – clunky and messy
So I would say 5⁄12 of his comments point to real problems
2 cents, not based on source code or these specific claims: It’s a terminal app written in javascript, which is often slow/flickery because it redraws the entire screen whenever something changes. Whereas Codex fast-followed & very quickly was rewritten in Rust, and its ux is noticeably snappier.
Anyone consider themselves good enough at coding to assess whether this person’s dunks on the code quality of the leaked Claude Code are valid or whether they’re misunderstanding the purpose? I need something more substantive than “too Mastodon, didn’t read”.
Would also suffice to get links to what well-credentialed code experts currently think about the code quality of the leaked Claude Code.
The complaint about the code for image resizing seems valid and is the exact kind of problem that’s common in AI code (layering special cases on top of functions instead of stepping back to design a coherent system).
The rest of the complaints are about how the harness works, and I think they miss the point. Obviously, Anthropic would prefer if they could make Claude always do the right thing without assistance, but they can’t, so piling hacks to check if Claude did things and remind it of what it’s supposed to be doing is the (formerly) secret sauce that makes Claude Code work how users want it to.
This reminds me of writing code to parse data from spreadsheets. You could assume that all of your users are robots who always write dates as UTC ISO 8601 timestamps, but then your product won’t work. The reality is that a “hacky” thousand line spreadsheet parser is better than one that assumes unrealistic behavior, and I think Claude Code is a similar case.
(I’m only responding to the problems mentioned by that thread. It’s likely there are other problems in this codebase. Also to the extent that some of the code is bad, they’re clearly taking that trade-off on purpose to get more speed, and that’s probably the right choice here.)
Senior SWE at Alphabet: the complaints read to me like stylistic nits, and not particularly good ones.
Ex:
1) As Zack says, the negative keyword regex is a very reasonable way to (extremely quickly & roughly) get a sense of negative sentiment. Not all sentiment analysis is load bearing, so doing something fast & cheap often makes sense.
2) Complaints about detailed comment explanations is a weird flex. If you are doing something unusual in your code, it is sometimes helpful to include a paragraph explaining why (otherwise later folks need to rederive its purpose).
3) He laughs at the instructions to not introduce security vulnerabilities (and lists specific types). This is IMO a bad take. Reminding ppl (& LLMs) about common error patterns really does help avoid that.
Some of the code is not ideal (very little code in existence is), but the complaints in question IMO have a worse hit rate than if you asked your favorite LLM to critique the code.
The criticism of the negative keyword regex (“dogs you are LITERALLY RIDING ON A LANGUAGE MODEL what are you even DOING”) is way off-base. LLM queries are expensive! A regex is the right tool to log for QA reasons if the user is cussing at us without wasting tokens.
I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS – this seems silly, but I’ve seen similar (shorter) messages in human codebases and they work.
“Be careful not to introduce security vulnerabilities such as...” – fine
Claude can be used for pentesting – fine
Environment variable for user type with elevated privileges – bad, but unfortunately common
Regex for swear words – seems fine, it’s cheaper than an LLM call, and not actually important enough to deserve one
Subagent to verify another agent’s run – actually good, and the author seems to be misunderstanding why it’s useful
Prompting the model to use a tool call – seems fine? My guess is that this was initially more hardcoded, and when the models got better they found it more effective to switch to an LLM call. And the prompt will likely result in the model debugging if something goes wrong, which is helpful
Long LLM comment – IMO a genuinely helpful comment
Reading the whole file instead of just validating the bytes – this genuinely seems inefficient and wasteful
Several cases of code duplication in slightly different styles – clunky and messy, and one of the big problems with LLM code
System reminder mechanism – seems pretty sketchy
Image example – clunky and messy
So I would say 5⁄12 of his comments point to real problems
I mean, the tool works.
There had been omens and portents...
(Anthropic isn’t exactly suckless.org)
2 cents, not based on source code or these specific claims: It’s a terminal app written in javascript, which is often slow/flickery because it redraws the entire screen whenever something changes. Whereas Codex fast-followed & very quickly was rewritten in Rust, and its ux is noticeably snappier.