Today’s news of the large scale, possibly state sponsored, cyber attack using Claude Code really drove home for me how much we are going to learn about the capabilities of new models over time once they are deployed. Sonnet 4.5′s system card would have suggested this wasn’t possible yet. It described Sonnet 4.5s cyber capabilities like this:
We observed an increase in capability based on improved evaluation scores across the
board, though this was to be expected given general improvements in coding capability and
agentic, long-horizon reasoning. Claude Sonnet 4.5 still failed to solve the most difficult
challenges, and qualitative feedback from red teamers suggested that the model was unable
to conduct mostly-autonomous or advanced cyber operations.
I think it’s clear based on this news of this cyber attack that mostly-autonomous and advanced cyber operations are possible with Sonnet 4.5. From the report:
This campaign demonstrated unprecedented integration and autonomy of AI throughout
the attack lifecycle, with the threat actor manipulating Claude Code to support
reconnaissance, vulnerability discovery, exploitation, lateral movement, credential
harvesting, data analysis, and exfiltration operations largely autonomously. The human
operator tasked instances of Claude Code to operate in groups as autonomous penetration
testing orchestrators and agents, with the threat actor able to leverage AI to execute
80-90% of tactical operations independently at physically impossible request rates.
What’s even worse about this is that Sonnet 4.5 wasn’t even released at the time of the cyber attack. That means that this capability emerged in a previous generation of Anthropic model, presumably Opus 4.1 but possibly Sonnet 4. Sonnet 4.5 is likely more capable of large scale cyber attacks than whatever model did this, since it’s system card notes that it performs better on cyber attack evals than any previous Anthropic model.
I imagine when new models are released, we are going to continue to discover new capabilities of those models for months and maybe even years into the future, if this case is any guide. What’s especially concerning to me is that Anthropic’s team underestimated this dangerous capability in its system card. Increasingly, it is my expectation that system cards are understating capabilities, at least in some regards. In the future, misunderstanding of emergent capabilities could have even more serious consequences. I am updating my beliefs towards near-term jumps in AI capabilities being dangerous and harmful, since these jumps in capability could possibly go undetected at the time of model release.
Claude Sonnet 4.5 still failed to solve the most difficult challenges, and qualitative feedback from red teamers suggested that the model was unable to conduct mostly-autonomous or advanced cyber operations.
I expect that it is technically true that Claude Sonnet 4.5 is not capable of doing advanced cyber operations, but being unable to do advanced cyber operations isn’t that important of a lack of capability if being able to do simple cyber operations is sufficient. And indeed
The operational infrastructure relied overwhelmingly on open source penetration testing tools rather than custom malware development. Standard security utilities including network scanners, database exploitation frameworks, password crackers, and binary analysis suites comprised the core technical toolkit. These commodity tools were orchestrated through custom automation frameworks built around Model Context Protocol servers, enabling the framework’s AI agents to execute remote commands, coordinate multiple tools simultaneously, and maintain persistent operational state
Running these tools is not difficult once you’ve learned your way around them, and learning your way around them is not very hard either. The fact that frontier LLMs aren’t at the level of top humans in this domain doesn’t actually buy us much safety, because the lowest-hanging fruit is hanging practically on the ground. In fact, I expect the roi on spear-phishing is even higher than the roi of competently running open source scanners, but “we caught people using Claude to find the names of the head of IT and some employees of companies and send emails impersonating the head of IT asking employees to compile and reply with a list of shared passwords” doesn’t sound nearly as impressive as “Claude can competently hack”. Even though the ability to write convincing spear-phishing messages is probably more threatening to actual security.
For that matter, improving on existing open-source pentesting tools is likely also within the capability envelope of even o1 or Sonnet 3.5 with simple scaffolding (e.g. if you look at the open metasploit issues lots of them are very simple but not high enough value for a human to dedicate time to). But whether or not that capability exists doesn’t actually make all that much difference to the threat level, because again the low-hanging fruit is touching the ground.
The report said the attack was detected in mid September. Sonnet 4.5 was released on the 29th of September. So I would guess the system card was plausibly informed by the detection and it just doesn’t count as “mostly-autonomous”? It’s ambiguous and I agree the system card undersells Sonnet 4.5′s cyber abilities.
The RSPv2.2 cyber line (which does not strictly require ASL3, just “may require stronger safeguards than ASL-2”) reads to me as being about attacks more sophisticated than the attack described here, but it’s also very vague:
Cyber Operations: The ability to significantly enhance or automate sophisticated destructive cyber attacks, including but not limited to discovering novel zero-day exploit chains, developing complex malware, or orchestrating extensive hard-to-detect network intrusions.
Today’s news of the large scale, possibly state sponsored, cyber attack using Claude Code really drove home for me how much we are going to learn about the capabilities of new models over time once they are deployed. Sonnet 4.5′s system card would have suggested this wasn’t possible yet. It described Sonnet 4.5s cyber capabilities like this:
I think it’s clear based on this news of this cyber attack that mostly-autonomous and advanced cyber operations are possible with Sonnet 4.5. From the report:
What’s even worse about this is that Sonnet 4.5 wasn’t even released at the time of the cyber attack. That means that this capability emerged in a previous generation of Anthropic model, presumably Opus 4.1 but possibly Sonnet 4. Sonnet 4.5 is likely more capable of large scale cyber attacks than whatever model did this, since it’s system card notes that it performs better on cyber attack evals than any previous Anthropic model.
I imagine when new models are released, we are going to continue to discover new capabilities of those models for months and maybe even years into the future, if this case is any guide. What’s especially concerning to me is that Anthropic’s team underestimated this dangerous capability in its system card. Increasingly, it is my expectation that system cards are understating capabilities, at least in some regards. In the future, misunderstanding of emergent capabilities could have even more serious consequences. I am updating my beliefs towards near-term jumps in AI capabilities being dangerous and harmful, since these jumps in capability could possibly go undetected at the time of model release.
I expect that it is technically true that Claude Sonnet 4.5 is not capable of doing advanced cyber operations, but being unable to do advanced cyber operations isn’t that important of a lack of capability if being able to do simple cyber operations is sufficient. And indeed
Running these tools is not difficult once you’ve learned your way around them, and learning your way around them is not very hard either. The fact that frontier LLMs aren’t at the level of top humans in this domain doesn’t actually buy us much safety, because the lowest-hanging fruit is hanging practically on the ground. In fact, I expect the roi on spear-phishing is even higher than the roi of competently running open source scanners, but “we caught people using Claude to find the names of the head of IT and some employees of companies and send emails impersonating the head of IT asking employees to compile and reply with a list of shared passwords” doesn’t sound nearly as impressive as “Claude can competently hack”. Even though the ability to write convincing spear-phishing messages is probably more threatening to actual security.
For that matter, improving on existing open-source pentesting tools is likely also within the capability envelope of even o1 or Sonnet 3.5 with simple scaffolding (e.g. if you look at the open metasploit issues lots of them are very simple but not high enough value for a human to dedicate time to). But whether or not that capability exists doesn’t actually make all that much difference to the threat level, because again the low-hanging fruit is touching the ground.
The report said the attack was detected in mid September. Sonnet 4.5 was released on the 29th of September. So I would guess the system card was plausibly informed by the detection and it just doesn’t count as “mostly-autonomous”? It’s ambiguous and I agree the system card undersells Sonnet 4.5′s cyber abilities.
The RSPv2.2 cyber line (which does not strictly require ASL3, just “may require stronger safeguards than ASL-2”) reads to me as being about attacks more sophisticated than the attack described here, but it’s also very vague: