Fabien Roger comments on Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Fabien Roger 4 Jul 2025 16:51 UTC
2 points
0
Information requirements: The 71% success rate assumes attackers know which component blocked their request—information that some current systems provide but could be withheld.

For systems like the Anthropic constitutional classifiers, can you know which classifier blocked your answer depending on whether you get streamed a first few tokens? (This doesn’t let you build an output classifier jailbreak before you succeeded at building an input classifier jailbreak, but I don’t see why that would be a big problem.)
If so, I’d be curious to see what happens if you apply your technique to Anthropic’s defenses. I saw multiple people describe this hypothetical attack scenario (and some attempt it without success), so I would be much more convinced this was a strong weakness if you actually made it work against a real production system (especially if you got an answer to the sort of questions that are centrally what these defenses try to defend).
I think that if these attacks worked for real it would be a big deal, especially since I this sort of “prefix jailbreaks” probably degrade performance much less than encoding-based jailbreaks.