Small models also found the vulnerabilities that Mythos found
We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos’s flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens. A 5.1B-active open model recovered the core chain of the 27-year-old OpenBSD bug.
I’ve been more skeptical than the average reader/commenter here around the capabilities of Mythos et al., and I also have some limited security experience.
It seems to me more surprising that human researchers didn’t discover these exploits, rather than that Mythos/Opus did.
Also, vulnerability research is a very wide field, and as they say, “there are levels to this game”.
Overall, I think that Mythos is probably more capable at cybersecurity, but I don’t share the vibe that it’s a “god in a box”, or other such monikers I’ve seen online.
Zvi covered this in a recent roundup. In gist, the smaller models don’t have Mythos’ ability to chain vulnerabilities and write effective exploits. They also have a much higher false positive rate. So their output would need extensive skilled manual effort to achieve anything near the same result. In contrast, Mythos output is directly useful to an attacker … or, fortunately, to a defender.
See also this discussion on HN.
Where is this quotation from? I don’t see a link.
Also: “isolated the relevant code”. That phrase could be doing a lot of heavy lifting here, right? It’s one thing to sift through a million lines of code and identify a bug. It’s another to be handed three lines that contain a bug and find it. Needle in a haystack vs. needle with two pieces of straw. If the methodology was identical, okay. But I’d like to see a side-by-side comparison of methodology here.
I googled the source link here: https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier
I’m also concerned about isolating the code. It’s the difference between finding a needle in a haystack, and distinguishing a needle from a single straw. Their set of models returned 12⁄18 false positives (and 18⁄18 true positives), which suggests terrible specificity to me.