Baybar’s Shortform

Baybar1 Oct 2025 21:02 UTC

2 points

7 comments1 min readLW link

Baybar 13 Nov 2025 19:38 UTC
66 points
16
Today’s news of the large scale, possibly state sponsored, cyber attack using Claude Code really drove home for me how much we are going to learn about the capabilities of new models over time once they are deployed. Sonnet 4.5′s system card would have suggested this wasn’t possible yet. It described Sonnet 4.5s cyber capabilities like this:
We observed an increase in capability based on improved evaluation scores across the board, though this was to be expected given general improvements in coding capability and agentic, long-horizon reasoning. Claude Sonnet 4.5 still failed to solve the most difficult challenges, and qualitative feedback from red teamers suggested that the model was unable to conduct mostly-autonomous or advanced cyber operations.
I think it’s clear based on this news of this cyber attack that mostly-autonomous and advanced cyber operations are possible with Sonnet 4.5. From the report:
This campaign demonstrated unprecedented integration and autonomy of AI throughout the attack lifecycle, with the threat actor manipulating Claude Code to support reconnaissance, vulnerability discovery, exploitation, lateral movement, credential harvesting, data analysis, and exfiltration operations largely autonomously. The human operator tasked instances of Claude Code to operate in groups as autonomous penetration testing orchestrators and agents, with the threat actor able to leverage AI to execute 80-90% of tactical operations independently at physically impossible request rates.
What’s even worse about this is that Sonnet 4.5 wasn’t even released at the time of the cyber attack. That means that this capability emerged in a previous generation of Anthropic model, presumably Opus 4.1 but possibly Sonnet 4. Sonnet 4.5 is likely more capable of large scale cyber attacks than whatever model did this, since it’s system card notes that it performs better on cyber attack evals than any previous Anthropic model.
I imagine when new models are released, we are going to continue to discover new capabilities of those models for months and maybe even years into the future, if this case is any guide. What’s especially concerning to me is that Anthropic’s team underestimated this dangerous capability in its system card. Increasingly, it is my expectation that system cards are understating capabilities, at least in some regards. In the future, misunderstanding of emergent capabilities could have even more serious consequences. I am updating my beliefs towards near-term jumps in AI capabilities being dangerous and harmful, since these jumps in capability could possibly go undetected at the time of model release.
- faul_sname 13 Nov 2025 23:46 UTC
  14 points
  0
  Parent
  
  Claude Sonnet 4.5 still failed to solve the most difficult challenges, and qualitative feedback from red teamers suggested that the model was unable to conduct mostly-autonomous or advanced cyber operations.
  
  I expect that it is technically true that Claude Sonnet 4.5 is not capable of doing advanced cyber operations, but being unable to do advanced cyber operations isn’t that important of a lack of capability if being able to do simple cyber operations is sufficient. And indeed
  
  The operational infrastructure relied overwhelmingly on open source penetration testing tools rather than custom malware development. Standard security utilities including network scanners, database exploitation frameworks, password crackers, and binary analysis suites comprised the core technical toolkit. These commodity tools were orchestrated through custom automation frameworks built around Model Context Protocol servers, enabling the framework’s AI agents to execute remote commands, coordinate multiple tools simultaneously, and maintain persistent operational state
  
  Running these tools is not difficult once you’ve learned your way around them, and learning your way around them is not very hard either. The fact that frontier LLMs aren’t at the level of top humans in this domain doesn’t actually buy us much safety, because the lowest-hanging fruit is hanging practically on the ground. In fact, I expect the roi on spear-phishing is even higher than the roi of competently running open source scanners, but “we caught people using Claude to find the names of the head of IT and some employees of companies and send emails impersonating the head of IT asking employees to compile and reply with a list of shared passwords” doesn’t sound nearly as impressive as “Claude can competently hack”. Even though the ability to write convincing spear-phishing messages is probably more threatening to actual security.
  
  For that matter, improving on existing open-source pentesting tools is likely also within the capability envelope of even o1 or Sonnet 3.5 with simple scaffolding (e.g. if you look at the open metasploit issues lots of them are very simple but not high enough value for a human to dedicate time to). But whether or not that capability exists doesn’t actually make all that much difference to the threat level, because again the low-hanging fruit is touching the ground.
- Fabien Roger 14 Nov 2025 0:49 UTC
  6 points
  0
  Parent
  The report said the attack was detected in mid September. Sonnet 4.5 was released on the 29th of September. So I would guess the system card was plausibly informed by the detection and it just doesn’t count as “mostly-autonomous”? It’s ambiguous and I agree the system card undersells Sonnet 4.5′s cyber abilities.
  The RSPv2.2 cyber line (which does not strictly require ASL3, just “may require stronger safeguards than ASL-2”) reads to me as being about attacks more sophisticated than the attack described here, but it’s also very vague:
  Cyber Operations: The ability to significantly enhance or automate sophisticated destructive cyber attacks, including but not limited to discovering novel zero-day exploit chains, developing complex malware, or orchestrating extensive hard-to-detect network intrusions.
Baybar 1 Oct 2025 21:02 UTC
39 points
1
I’ve discovered that generating a video of yourself with Sora 2 saying something like ‘this video was generated by AI’ is sufficiently freaky enough to make people who know you well, especially those skeptical about AI capabilities, start to freak out a bit.
Thought this might be a useful idea for others trying to persuade people to tune in, and not just auto reject the idea that very capable systems might be right around the corner.
- Seth Herd 3 Oct 2025 16:20 UTC
  3 points
  0
  Parent
  Yes! Also, making a NotebookLM podcast about your own work is similarly startling to the uninitiated. They sound very human.
- samuelshadrach 2 Oct 2025 13:50 UTC
  3 points
  2
  Parent
  I want to prank video call people using my face and voice (and the text content itself) generated in realtime.
Baybar 23 Oct 2025 13:38 UTC
3 points
0
An AI company I’ve never heard of called AGI, Inc has a model called AGI-0 that has achieved 76.3% on OSWorld-verified. This would qualify as human-level computer use, at least by that benchmark. It appears on the official OSWorld-verified leaderboard. It does seem like they trained on the benchmark, which could explain some of this. I am curious to see someone test this model.
This is a large increase from the previous state of the art, which has been climbing rapidly since Claude Sonnet 4.5′s September 29th release. At that point, Claude achieved 61.4% on the OSWorld-verified. A scaffolded GPT-5 achieved even higher, 69.9%, on October 3rd. Now, on October 21st, AGI-0, seemingly a frontier computer use model, has outpaced them all, and surpassed the human benchmark in doing so.
AI-2027 projected a 65% on the OSWorld for August 2025. It predicted frontier models scoring 80% on the OSWorld privately in December 2025. It predicted models achieving this score would be available publicly in April 2026. This score on the OsWorld-verified is more than two thirds of the way to the 80% benchmark from the expected August capabilities. This is despite being less than a quarter of the way from August 2025 to an expected public release of a model with these capabilities. Assuming this isn’t just benchmark overfitting, the real world is even or ahead of AI-2027 on this computer usage benchmark.
Even more notably, AI-2027 projected this 80% benchmark would be met by “Agent 1”, their hypothetical leading AI agentic model at the end of 2025. It seems surprising that a frontier model from a new company would achieve something close to this without any of the main players’ (OpenAI, Anthropic, Google) models doing better than 61%. A lot to be curious and skeptical about here.
Update: it has been removed from the OSWorld-verified leaderboard, but they are still claiming to have done it and their results are downloadable.