Anthropic has not demonstrated that their safeguards work reliably enough in practice, and thus I am not very sympathetic to their perspective on the merits of the case (although procedurally this is obviously bad from the Trump admin). It seems good if the result of this is that the burden of proof of safety shifts towards the companies.
Yes, external testing has not found an universal jailbreak according to the model report, but it’s pretty unclear that the focus on universal jailbreaks is sufficient. No external party has tested Anthropic’s models (or other labs’ for that matter) AND their defense in depth approach to explicitly certify that they work sufficiently well, instead it’s isolated fishing for universal jailbreaks, and otherwise taking Anthropic at their word. I don’t think a convincing case has been made that universal jailbreaks are necessary to cause harm with a model, in fact it seems very possible that an attacker could use a custom jailbreak that works for a given environment or/and pivot to different jailbreaks when one stops working.
To me, this seems like potentially pretty misleading communication from Anthropic and somewhat silly.
Amazon claims that they used a technique that successfully bypasses Fable’s safeguards to get an response from Mythos, which IIUC Anthropic does not dispute. To the question about whether the safeguards work, it is probably immaterial whether the response is something you could have gotten from other models. The value of this demonstration is that it shows that the safeguards did not work, not that in this specific case any harm could/was done. Unless Anthropic makes the imho very implausible claim that the safeguards are developed in a way that they detect whether a particular vulnerability is previously known (or minor), this technique is evidence that an attacker totally could have used this for vulnerabilities that other models could not find.