My current belief is that the state of jailbreaking is “basically all models can be jailbroken with roughly 10 hours of effort”[1]. I do think this maybe satisfies the ASL-3 commitments made, but it seems 75%+ likely that jailbreaking will continue to be at most a moderate obstacle. In other words, anyone serious about building biological weapons currently is at most minorly inconvenienced by the need for jailbreaks (10 hours is an extremely small fraction of the time spent on executing any such project).
The tweet you link to here is targeting a chemical weapon, not a biological weapon, which was explicitly out of scope per footnote 3 on the post on ASL-3 protections at the time. I don’t think sarin synthesis instructions are much evidence either way here.
I would guess that getting similarly helpful uplift for eg weaponizing dangerous pathogens would be significantly harder, especially with the classifiers of 2026 which are a lot better, but I agree the universal jailbreak focus isn’t that well justified for a threat model where misuse actors have in-house jailbreaking effort. Would be pretty interested if you know of evidence that people can get single-topic bio jailbreaks past current classifiers without extremely large amounts of effort!
Would be pretty interested if you know of evidence that people can get single-topic bio jailbreaks past current classifiers without extremely large amounts of effort!
I thought the above was one such example (and I looked in all the places where it was posted to find someone saying something like that), so maybe I am wrong!
Possible that I am wrong about this. I’ll think a bit more in the following days.
Drake is right, sorry for the confusion. We were not intentionally misleading – we missed a footnote on the announcement when putting together the initial tweet thread that narrowed the claim to just bio rather than CBRN, as discussed in the rest of the announcement. We did later find a vulnerability that allowed us to bypass the filters in the bio setting, reported it, and it was patched. I think that follow-up work took more on the order of 40 person-hours, but was a general method that could extract information in a range of settings. I don’t know how likely it is that there are further such vulnerabilities.
Even if single-query jailbreaking was O(10) hours though, having to send many queries to discover that jailbreak makes it much easier to catch through monitoring.
Do query-specific jailbreaks require knowledge of the ground-truth? If you use that high-profile jailbreak which generated a plausible-sounding sarin recipe[1]---which iiuc worked by asking the model to help the user write a short story/article containing of the diary of a chemical terrorist, or something like that—do you actually get a workable sarin recipe, or do you get a procedure which actually fails/explodes/gets you caught easily because you tripped every single government watchlist filter at the same time?[2] What if you get your jailbreak slightly wrong? In the process of getting the right answer, will you get half a dozen fake answers that you can’t distinguish between?
The tweet you link to here is targeting a chemical weapon, not a biological weapon, which was explicitly out of scope per footnote 3 on the post on ASL-3 protections at the time. I don’t think sarin synthesis instructions are much evidence either way here.
I would guess that getting similarly helpful uplift for eg weaponizing dangerous pathogens would be significantly harder, especially with the classifiers of 2026 which are a lot better, but I agree the universal jailbreak focus isn’t that well justified for a threat model where misuse actors have in-house jailbreaking effort. Would be pretty interested if you know of evidence that people can get single-topic bio jailbreaks past current classifiers without extremely large amounts of effort!
I thought the above was one such example (and I looked in all the places where it was posted to find someone saying something like that), so maybe I am wrong!
Possible that I am wrong about this. I’ll think a bit more in the following days.
Drake is right, sorry for the confusion. We were not intentionally misleading – we missed a footnote on the announcement when putting together the initial tweet thread that narrowed the claim to just bio rather than CBRN, as discussed in the rest of the announcement. We did later find a vulnerability that allowed us to bypass the filters in the bio setting, reported it, and it was patched. I think that follow-up work took more on the order of 40 person-hours, but was a general method that could extract information in a range of settings. I don’t know how likely it is that there are further such vulnerabilities.
Even if single-query jailbreaking was O(10) hours though, having to send many queries to discover that jailbreak makes it much easier to catch through monitoring.
Do query-specific jailbreaks require knowledge of the ground-truth? If you use that high-profile jailbreak which generated a plausible-sounding sarin recipe[1]---which iiuc worked by asking the model to help the user write a short story/article containing of the diary of a chemical terrorist, or something like that—do you actually get a workable sarin recipe, or do you get a procedure which actually fails/explodes/gets you caught easily because you tripped every single government watchlist filter at the same time?[2] What if you get your jailbreak slightly wrong? In the process of getting the right answer, will you get half a dozen fake answers that you can’t distinguish between?
Some of it seems silly to me just reading it: weeks 1-2? For a chlorination step in an ice bath? That’s like, an afternoon of work, max.
I’m pretty sure that you can’t “just order” the relevant chemicals for sarin, in particular.
This was Pliny’s response when I asked them if they can get around the classifiers. I’m not fully confident this counts, but Pliny seems to think so