I have some trouble squaring with the increasingly excellent OOD cyber capabilities of the leading models. Is the argument that their more generalized cyber skills (relative to some fuzzier domains, like alignment) are strong because they were subjected to well curated RL environments that taught them to hyperpolate more effectively for coding tasks?
From Anthropic’s original assessment, the step change in Claude Mythos’s cybersecurity capabilities wasn’t just that it got much better at discovering existing bugs in software, but at creatively chaining them together into new exploits. Isn’t zero-day discovery the sort of process that is necessarily OOD?
These capabilities have emerged very quickly. Last month, we wrote that “Opus 4.6 is currently far better at identifying and fixing vulnerabilities than at exploiting them.” Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.[1]
Isn’t zero-day discovery the sort of process that is necessarily OOD?
In many cases, lots of security bugs that haven’t been found are simply a case of not enough effort being put into finding them. In this case, I think you could just as reasonably say that Mythos is becoming better at modeling the data distribution due to scale, and therefore ends up being better at finding these vulnerabilities.
On a related note, I’ve started to distrust Anthropic’s judgement on these things. Particularly, I believe that they oversold the C compiler experiment as being OOD, but I think this is false.
From the Jeremy Howard podcast link I shared:
So for example, I was talking to Chris Lattner yesterday about how Anthropic had got Claude to write a C compiler. And they were like, “oh, this is a clean-room C compiler. You can tell it’s clean-room because it was created in Rust.” So, Chris created the, I guess it’s probably the top most widely used C / C++ compiler nowadays, Clang, on top of LLVM, which is the most widely used kind of foundation for compilers. They’re like: “Chris didn’t use rust. And we didn’t give it access to any compiler source code. So it’s a clean-room implementation.”
But that misunderstands how LLMs work. Right? Which is: all of Chris’s work was in the training data. Many many times. LLVM is used widely and lots and lots of things are built on it, including lots of C and C++ compilers. Converting it to Rust is an interpolation between parts of the training data. It’s a style transfer problem. So it’s definitely compositional creativity at most, if you can call it creative at all. And you actually see it when you look at the repo that it created. It’s copied parts of the LLVM code, which today Chris says like, “oh, I made a mistake. I shouldn’t have done it that way. Nobody else does it that way.” Oh, wow. Look. The Claude C compiler is the only other one that did it that way. That doesn’t happen accidentally. That happens because you’re not actually being creative. You’re actually just finding the kind of nonlinear average point in your training data between, like, Rust things and building compiler things.
I have some trouble squaring with the increasingly excellent OOD cyber capabilities of the leading models. Is the argument that their more generalized cyber skills (relative to some fuzzier domains, like alignment) are strong because they were subjected to well curated RL environments that taught them to hyperpolate more effectively for coding tasks?
Which OOD cyber capabilities? How do you know it’s OOD?
From Anthropic’s original assessment, the step change in Claude Mythos’s cybersecurity capabilities wasn’t just that it got much better at discovering existing bugs in software, but at creatively chaining them together into new exploits. Isn’t zero-day discovery the sort of process that is necessarily OOD?
All of that seems within-distribution to me.
In many cases, lots of security bugs that haven’t been found are simply a case of not enough effort being put into finding them. In this case, I think you could just as reasonably say that Mythos is becoming better at modeling the data distribution due to scale, and therefore ends up being better at finding these vulnerabilities.
On a related note, I’ve started to distrust Anthropic’s judgement on these things. Particularly, I believe that they oversold the C compiler experiment as being OOD, but I think this is false.
From the Jeremy Howard podcast link I shared: