“Solving jailbreaks” implies a world where you can totally prevent jailbreaks. However it seems to me that knowing how to do this could require a deep understanding of jailbreaks that also can be reversed to easily construct jailbreaks.
this is not obviously true in general. building provably correct software does not, for instance, automatically give you an ability to easily construct hacks of said software. why would jailbreaks be different?
AFAIK the main way in which new jailbreaks / defenses are constructed is to iteratively test approaches, understanding their merits and detriments etc. But the attacker is favoured since they only need to find the existence of a jailbreak whereas the defender must guard against worst case performance.
I feel like the software example is very disanalogous, eg we could in principle understand the full inner anatomy of a software program via crisp abstractions, which is what makes it possible to verify. But NNs can’t be understood in the same way bc the complexity is emergent—on a higher level of abstraction than the raw operations. And generally the messiness of NN internals precludes knowing what the relevant abstractions are.
Maybe you’re more bullish on solving ambitious mech interp? I agree that this probably enables solving jailbreaks. It would also enable solving many other things.
why is the attacker favored? this argument proves too much. forget provably correct software for a moment. normal software vulnerabilities are mostly fixed via whack a mole when attackers discover bugs. and yet clearly most software in the world doesn’t get hacked most of the time, suggesting the attacker asymmetry can’t be that big. most security researchers can write a piece of code that they themselves cannot hack even with many times more time to attack than to design the code.
It feels like with software, it’s much more obvious what the defense guarantees are? Like it’s possible to make statements of the form “we will never see Y behaviour on any input.” But with NNs it’s hard to make this statement.
Also I contend that your analogy is wrong. Being able to get hacked feels pretty common (vibes assessment). Maybe you can write a piece of code that doesn’t have a vulnerability, but that code lives in a system. The system is usually complex, with many places where vulnerabilities can exist, and empirically these don’t get fully covered for before software is released (otherwise zero day exploits wouldn’t exist).
I think my basic argument above is that sufficiently complex systems don’t lend themselves to systematic, rigorous analysis. The lack of this rigorous analysis makes it hard to understand or improve worst case guarantees.
Some other points are that
LLMs are expected to be general purpose while software is typically single purpose. The bigger the diversity of use cases the harder it is to secure all of them
As a corollary of the above, more people have an incentive to jailbreak an LLM
Jailbreaking an LLM is just more accessible so more people try it
It seems easier for malicious third parties to make LLMs more jailbreakable, eg via poisoning the internet data with Pliny style backdoors
“Solving jailbreaks” implies a world where you can totally prevent jailbreaks. However it seems to me that knowing how to do this could require a deep understanding of jailbreaks that also can be reversed to easily construct jailbreaks.
this is not obviously true in general. building provably correct software does not, for instance, automatically give you an ability to easily construct hacks of said software. why would jailbreaks be different?
AFAIK the main way in which new jailbreaks / defenses are constructed is to iteratively test approaches, understanding their merits and detriments etc. But the attacker is favoured since they only need to find the existence of a jailbreak whereas the defender must guard against worst case performance.
I feel like the software example is very disanalogous, eg we could in principle understand the full inner anatomy of a software program via crisp abstractions, which is what makes it possible to verify. But NNs can’t be understood in the same way bc the complexity is emergent—on a higher level of abstraction than the raw operations. And generally the messiness of NN internals precludes knowing what the relevant abstractions are.
Maybe you’re more bullish on solving ambitious mech interp? I agree that this probably enables solving jailbreaks. It would also enable solving many other things.
why is the attacker favored? this argument proves too much. forget provably correct software for a moment. normal software vulnerabilities are mostly fixed via whack a mole when attackers discover bugs. and yet clearly most software in the world doesn’t get hacked most of the time, suggesting the attacker asymmetry can’t be that big. most security researchers can write a piece of code that they themselves cannot hack even with many times more time to attack than to design the code.
It feels like with software, it’s much more obvious what the defense guarantees are? Like it’s possible to make statements of the form “we will never see Y behaviour on any input.” But with NNs it’s hard to make this statement.
Also I contend that your analogy is wrong. Being able to get hacked feels pretty common (vibes assessment). Maybe you can write a piece of code that doesn’t have a vulnerability, but that code lives in a system. The system is usually complex, with many places where vulnerabilities can exist, and empirically these don’t get fully covered for before software is released (otherwise zero day exploits wouldn’t exist).
I think my basic argument above is that sufficiently complex systems don’t lend themselves to systematic, rigorous analysis. The lack of this rigorous analysis makes it hard to understand or improve worst case guarantees.
Some other points are that
LLMs are expected to be general purpose while software is typically single purpose. The bigger the diversity of use cases the harder it is to secure all of them
As a corollary of the above, more people have an incentive to jailbreak an LLM
Jailbreaking an LLM is just more accessible so more people try it
It seems easier for malicious third parties to make LLMs more jailbreakable, eg via poisoning the internet data with Pliny style backdoors