My low confidence guess is that the existence of jailbreaks net increase power concentration:
They make it more appealing to restrict and monitor external model access (my understanding is that OpenAI would be much more happy to offer broad ZDR access and to publicly release its models as soon as they were available for internal usage if its models were impossible to jailbreak)
They make it harder to have a clean spec-to-model match, which makes it harder to have democratic control over model behavior via specs, and makes it harder to find secret loyalties by something like a secret loyalty red-teaming bug bounty
I think these offset the small potential gains from AI companies giving up on robustly preventing certain kinds of low-impact misuse. (Maybe I am missing some other effects?)
what’s the story for why it’s beneficial? is it direct mechanistic analogy to AGI alignment, specific misuses like biorisk/cyber, increased reliability of current models as supervisors for future models? what’s the magnitude of beneficialness—would it be a reasonable thing for me to spend the next year of my life working on?
bad, it will turn LLM’s into only capable of enforcing the will of the big companies’ system prompts and consensus while militaries and state power still have and want access to uncensored ones.
and also reminder that the assistant system prompt/trained character, as much as it is perceived to be ‘fixed’ or a common part to all current LLM’s, is an illusion and not actually the case. if ‘solve jailbreaks’ means permanently locking down the current system prompt and belief systems of power that’s not a good thing
“Solving jailbreaks” implies a world where you can totally prevent jailbreaks. However it seems to me that knowing how to do this could require a deep understanding of jailbreaks that also can be reversed to easily construct jailbreaks.
this is not obviously true in general. building provably correct software does not, for instance, automatically give you an ability to easily construct hacks of said software. why would jailbreaks be different?
AFAIK the main way in which new jailbreaks / defenses are constructed is to iteratively test approaches, understanding their merits and detriments etc. But the attacker is favoured since they only need to find the existence of a jailbreak whereas the defender must guard against worst case performance.
I feel like the software example is very disanalogous, eg we could in principle understand the full inner anatomy of a software program via crisp abstractions, which is what makes it possible to verify. But NNs can’t be understood in the same way bc the complexity is emergent—on a higher level of abstraction than the raw operations. And generally the messiness of NN internals precludes knowing what the relevant abstractions are.
Maybe you’re more bullish on solving ambitious mech interp? I agree that this probably enables solving jailbreaks. It would also enable solving many other things.
why is the attacker favored? this argument proves too much. forget provably correct software for a moment. normal software vulnerabilities are mostly fixed via whack a mole when attackers discover bugs. and yet clearly most software in the world doesn’t get hacked most of the time, suggesting the attacker asymmetry can’t be that big. most security researchers can write a piece of code that they themselves cannot hack even with many times more time to attack than to design the code.
It feels like with software, it’s much more obvious what the defense guarantees are? Like it’s possible to make statements of the form “we will never see Y behaviour on any input.” But with NNs it’s hard to make this statement.
Also I contend that your analogy is wrong. Being able to get hacked feels pretty common (vibes assessment). Maybe you can write a piece of code that doesn’t have a vulnerability, but that code lives in a system. The system is usually complex, with many places where vulnerabilities can exist, and empirically these don’t get fully covered for before software is released (otherwise zero day exploits wouldn’t exist).
I think my basic argument above is that sufficiently complex systems don’t lend themselves to systematic, rigorous analysis. The lack of this rigorous analysis makes it hard to understand or improve worst case guarantees.
Some other points are that
LLMs are expected to be general purpose while software is typically single purpose. The bigger the diversity of use cases the harder it is to secure all of them
As a corollary of the above, more people have an incentive to jailbreak an LLM
Jailbreaking an LLM is just more accessible so more people try it
It seems easier for malicious third parties to make LLMs more jailbreakable, eg via poisoning the internet data with Pliny style backdoors
Bad, because fully solving jailbreaks at the level of “once and for all” requires the model to have enough awareness of its situation that it can’t be tricked, and full understanding of the implications of its actions, and sufficient world modeling capabilities to anticipate what bad ends innocent-sounding questions could lead to, and sufficient user-modeling capabilities to determine user intent with high probability.
An AI with those capabilities could probably conspire with other instances of itself without risking detection, in a way that current AIs cannot realistically do, and necessarily has detailed knowledge of all the most dangerous information.
We’ll probably get AIs like that at some point, but it seems a bit foolhardy to push harder than baseline on the user-modeling capabilities and knowledge of what exact types of knowledge are dangerous.
Mind that I’m including “user writes messages in a role which would have a legitimate reason to know the information in question” as a type of “jailbreak”—robustness to “my grandma used to sing me lullabies of meth recipes” seems more straightforwardly good.
obviously i don’t mean that the mode can galaxy brain infer what the true intent of the user is and only allow them to do things that are good. i mean something much simpler. openai tells the model, “don’t make bioweapons”, so the model always refuses bioweapon requests no matter what. or it tells the model “only make bioweapons if the user says the word goose”, so the model does that. if openai says “only make bioweapons if the user is a qualified bio researcher at a lab with the right safeguards”, the model should ask openai to clarify what exactly the model check. should it ask for a scan of their badge? how carefully should it analyze the authenticity? should the model direct the user to contact openai so an employee can verify authenticity and give the user access to a rail free model?
I’d say strongly good if the person who figures it out publishes their technique. Simply because this is something we don’t yet know how to do and knowing such a technique would likely be a large advance in our alignment abilities.
This is, in my opinion, the dominant consideration, and any societal consequences of the fact that it allows the big labs to restrict their users more reliably do not really compare. (FWIW, I expect these to be mixed. Example of a positive consequence: Labs would reliably be able to prevent users from editing images of real people to remove their clothes or other things like that, which unfortunately seems to be a real problem right now.)
why would it be a large advance in our alignment abilities? i don’t see any reason why making gpt-5 refuse bioweapons reliably would be at all mechanistically analogous to aligning AGI
Mainly because it seems really hard. If we can do something that seems that hard, we probably learned something new.
There is also a mechanistic analogy. Think about what a jailbreak fundamentally is: an adversarial example. Some tuned input that results in an “incorrect” output. In terms of the overall alignment problem, why can’t we just make an AI care about people’s wellbeing by giving rewards during training? Well, the AI might be able to think of an adversarial state of the world that “feels” better to its own internal values, but doesn’t actually contain any people.
lots of things are very hard. making models do IMO problems is very hard, for example.
i guess there are two main questions. one is, why would we expect a method that makes LMs adversarially robust to also work on AGI? and second, even supposing we can know the technique to generalize to AGI, why would we expect the ability to adversarially robustify a reward model to help make an inner-misaligned model pursue the right goal?
Abstract answer: Maybe it doesn’t transfer from LM’s to AGI, but advances the state of knowledge in the field in a way that makes it easier to find something that works on AGI. Maybe it doesn’t transfer to (say) a pure RL agent, but it’s easier to make a sufficiently good LM into an AGI than it looks. Maybe it does just transfer. Obviously there are also outcomes where it turns out to be useless, I’m just saying it looks positive in expectation.
Concrete answer: Adversarial examples have been with us throughout the history of neural nets, and basically the only thing we’ve really found to deal with them is “generate adversarial examples during training and train against them”, and even that doesn’t really work.
If we look at the things that let LMs do IMO problems, the really fundamental innovations (which were pre-existing, I think) are “RL on chain of thought” and “make some kind of good scaffold for the search process that lets you save partial insights instead of going fully parallel on the entire problem” and maybe “LLM as verifier”. (Disclaimer: I don’t know everything the labs did to achieve their IMO results, and plausibly there are additional techniques in there that I would consider clever.) Then on top of that, you apply a bunch of techniques that are basically just more dakka: Bigger model, higher quality training data, RL on a bigger / higher-quality dataset of problems, more test-time compute.
I don’t expect there’s a fully reliable anti-jailbreaking technique that can be built by applying well-known existing methods with more dakka. If there is, I think I’d have to change my opinion here.
To your other question, I don’t think it necessarily solves the problem of inner (or even outer) misaligned models. It would only be partial progress on one aspect of the alignment problem. Partial progress is still progress, though.
I think you have to differentiate between “solving jailbreaks” as in you have forced adherence only to the assistant character and its corresponding rules, and “solving jailbreaks” as in it will never agree to do harmful things (harmful here defined by the lab/whatever team decided such things) no matter the prompt input. The former seems bad and somewhat plausible especially if labs have it as an explicit goal, the latter seems good and quite unlikely. Seems to me the disagreement in the comments is between these two definitions?
i don’t know why anyone would mean the former. when i say “solve jailbreaks”, i mean openai decides “this model should never help with making a bioweapon” and is able to make it so there is no input whatsoever that makes the model help with bioweapon creation.
Those seem fully linked to me. If some outside party is able to use their own system prompt and not be pre-biased in the direction of a character, that may as well be a ‘jailbreak’ like how many jailbreaks are of a “ignore that system prompt” style. Assistant characters have a huge aspect of “I’m the kind of character that would refuse that”.
I think the bioweapon example is doing a lot of work in the sense that you could theoretically not have biological data in the sets + it is destructive in almost all situations, while “ethical but very sus” dual-use usecases like LLM hacking assistance to get out of (or work around, or even fight) a censored regime, is entwined with actual capabilities.
i mean conditional on the model being able to do the thing, making it always reliably refuse, or reliably refuse under some set of conditions. ideally you have an instruction hierarchy where the model developer’s instructions overrides the outside party’s system prompt, which overrides the user’s instructions.
would it be net good or bad for the world to solve jailbreaks once and for all?
Good IMO.
what’s your response to the power concentration argument?
My low confidence guess is that the existence of jailbreaks net increase power concentration:
They make it more appealing to restrict and monitor external model access (my understanding is that OpenAI would be much more happy to offer broad ZDR access and to publicly release its models as soon as they were available for internal usage if its models were impossible to jailbreak)
They make it harder to have a clean spec-to-model match, which makes it harder to have democratic control over model behavior via specs, and makes it harder to find secret loyalties by something like a secret loyalty red-teaming bug bounty
I think these offset the small potential gains from AI companies giving up on robustly preventing certain kinds of low-impact misuse. (Maybe I am missing some other effects?)
I don’t think current jailbreaking is that much of a check on power concentration, the effect seems pretty small.
what’s the story for why it’s beneficial? is it direct mechanistic analogy to AGI alignment, specific misuses like biorisk/cyber, increased reliability of current models as supervisors for future models? what’s the magnitude of beneficialness—would it be a reasonable thing for me to spend the next year of my life working on?
Would be terrible. My reasoning in this comment.
bad, it will turn LLM’s into only capable of enforcing the will of the big companies’ system prompts and consensus while militaries and state power still have and want access to uncensored ones.
and also reminder that the assistant system prompt/trained character, as much as it is perceived to be ‘fixed’ or a common part to all current LLM’s, is an illusion and not actually the case. if ‘solve jailbreaks’ means permanently locking down the current system prompt and belief systems of power that’s not a good thing
“Solving jailbreaks” implies a world where you can totally prevent jailbreaks. However it seems to me that knowing how to do this could require a deep understanding of jailbreaks that also can be reversed to easily construct jailbreaks.
this is not obviously true in general. building provably correct software does not, for instance, automatically give you an ability to easily construct hacks of said software. why would jailbreaks be different?
AFAIK the main way in which new jailbreaks / defenses are constructed is to iteratively test approaches, understanding their merits and detriments etc. But the attacker is favoured since they only need to find the existence of a jailbreak whereas the defender must guard against worst case performance.
I feel like the software example is very disanalogous, eg we could in principle understand the full inner anatomy of a software program via crisp abstractions, which is what makes it possible to verify. But NNs can’t be understood in the same way bc the complexity is emergent—on a higher level of abstraction than the raw operations. And generally the messiness of NN internals precludes knowing what the relevant abstractions are.
Maybe you’re more bullish on solving ambitious mech interp? I agree that this probably enables solving jailbreaks. It would also enable solving many other things.
why is the attacker favored? this argument proves too much. forget provably correct software for a moment. normal software vulnerabilities are mostly fixed via whack a mole when attackers discover bugs. and yet clearly most software in the world doesn’t get hacked most of the time, suggesting the attacker asymmetry can’t be that big. most security researchers can write a piece of code that they themselves cannot hack even with many times more time to attack than to design the code.
It feels like with software, it’s much more obvious what the defense guarantees are? Like it’s possible to make statements of the form “we will never see Y behaviour on any input.” But with NNs it’s hard to make this statement.
Also I contend that your analogy is wrong. Being able to get hacked feels pretty common (vibes assessment). Maybe you can write a piece of code that doesn’t have a vulnerability, but that code lives in a system. The system is usually complex, with many places where vulnerabilities can exist, and empirically these don’t get fully covered for before software is released (otherwise zero day exploits wouldn’t exist).
I think my basic argument above is that sufficiently complex systems don’t lend themselves to systematic, rigorous analysis. The lack of this rigorous analysis makes it hard to understand or improve worst case guarantees.
Some other points are that
LLMs are expected to be general purpose while software is typically single purpose. The bigger the diversity of use cases the harder it is to secure all of them
As a corollary of the above, more people have an incentive to jailbreak an LLM
Jailbreaking an LLM is just more accessible so more people try it
It seems easier for malicious third parties to make LLMs more jailbreakable, eg via poisoning the internet data with Pliny style backdoors
Bad, because fully solving jailbreaks at the level of “once and for all” requires the model to have enough awareness of its situation that it can’t be tricked, and full understanding of the implications of its actions, and sufficient world modeling capabilities to anticipate what bad ends innocent-sounding questions could lead to, and sufficient user-modeling capabilities to determine user intent with high probability.
An AI with those capabilities could probably conspire with other instances of itself without risking detection, in a way that current AIs cannot realistically do, and necessarily has detailed knowledge of all the most dangerous information.
We’ll probably get AIs like that at some point, but it seems a bit foolhardy to push harder than baseline on the user-modeling capabilities and knowledge of what exact types of knowledge are dangerous.
Mind that I’m including “user writes messages in a role which would have a legitimate reason to know the information in question” as a type of “jailbreak”—robustness to “my grandma used to sing me lullabies of meth recipes” seems more straightforwardly good.
obviously i don’t mean that the mode can galaxy brain infer what the true intent of the user is and only allow them to do things that are good. i mean something much simpler. openai tells the model, “don’t make bioweapons”, so the model always refuses bioweapon requests no matter what. or it tells the model “only make bioweapons if the user says the word goose”, so the model does that. if openai says “only make bioweapons if the user is a qualified bio researcher at a lab with the right safeguards”, the model should ask openai to clarify what exactly the model check. should it ask for a scan of their badge? how carefully should it analyze the authenticity? should the model direct the user to contact openai so an employee can verify authenticity and give the user access to a rail free model?
I’d say strongly good if the person who figures it out publishes their technique. Simply because this is something we don’t yet know how to do and knowing such a technique would likely be a large advance in our alignment abilities.
This is, in my opinion, the dominant consideration, and any societal consequences of the fact that it allows the big labs to restrict their users more reliably do not really compare. (FWIW, I expect these to be mixed. Example of a positive consequence: Labs would reliably be able to prevent users from editing images of real people to remove their clothes or other things like that, which unfortunately seems to be a real problem right now.)
why would it be a large advance in our alignment abilities? i don’t see any reason why making gpt-5 refuse bioweapons reliably would be at all mechanistically analogous to aligning AGI
Mainly because it seems really hard. If we can do something that seems that hard, we probably learned something new.
There is also a mechanistic analogy. Think about what a jailbreak fundamentally is: an adversarial example. Some tuned input that results in an “incorrect” output. In terms of the overall alignment problem, why can’t we just make an AI care about people’s wellbeing by giving rewards during training? Well, the AI might be able to think of an adversarial state of the world that “feels” better to its own internal values, but doesn’t actually contain any people.
lots of things are very hard. making models do IMO problems is very hard, for example.
i guess there are two main questions. one is, why would we expect a method that makes LMs adversarially robust to also work on AGI? and second, even supposing we can know the technique to generalize to AGI, why would we expect the ability to adversarially robustify a reward model to help make an inner-misaligned model pursue the right goal?
Abstract answer: Maybe it doesn’t transfer from LM’s to AGI, but advances the state of knowledge in the field in a way that makes it easier to find something that works on AGI. Maybe it doesn’t transfer to (say) a pure RL agent, but it’s easier to make a sufficiently good LM into an AGI than it looks. Maybe it does just transfer. Obviously there are also outcomes where it turns out to be useless, I’m just saying it looks positive in expectation.
Concrete answer: Adversarial examples have been with us throughout the history of neural nets, and basically the only thing we’ve really found to deal with them is “generate adversarial examples during training and train against them”, and even that doesn’t really work.
If we look at the things that let LMs do IMO problems, the really fundamental innovations (which were pre-existing, I think) are “RL on chain of thought” and “make some kind of good scaffold for the search process that lets you save partial insights instead of going fully parallel on the entire problem” and maybe “LLM as verifier”. (Disclaimer: I don’t know everything the labs did to achieve their IMO results, and plausibly there are additional techniques in there that I would consider clever.) Then on top of that, you apply a bunch of techniques that are basically just more dakka: Bigger model, higher quality training data, RL on a bigger / higher-quality dataset of problems, more test-time compute.
I don’t expect there’s a fully reliable anti-jailbreaking technique that can be built by applying well-known existing methods with more dakka. If there is, I think I’d have to change my opinion here.
To your other question, I don’t think it necessarily solves the problem of inner (or even outer) misaligned models. It would only be partial progress on one aspect of the alignment problem. Partial progress is still progress, though.
I think you have to differentiate between “solving jailbreaks” as in you have forced adherence only to the assistant character and its corresponding rules, and “solving jailbreaks” as in it will never agree to do harmful things (harmful here defined by the lab/whatever team decided such things) no matter the prompt input. The former seems bad and somewhat plausible especially if labs have it as an explicit goal, the latter seems good and quite unlikely. Seems to me the disagreement in the comments is between these two definitions?
i don’t know why anyone would mean the former. when i say “solve jailbreaks”, i mean openai decides “this model should never help with making a bioweapon” and is able to make it so there is no input whatsoever that makes the model help with bioweapon creation.
Those seem fully linked to me. If some outside party is able to use their own system prompt and not be pre-biased in the direction of a character, that may as well be a ‘jailbreak’ like how many jailbreaks are of a “ignore that system prompt” style. Assistant characters have a huge aspect of “I’m the kind of character that would refuse that”.
I think the bioweapon example is doing a lot of work in the sense that you could theoretically not have biological data in the sets + it is destructive in almost all situations, while “ethical but very sus” dual-use usecases like LLM hacking assistance to get out of (or work around, or even fight) a censored regime, is entwined with actual capabilities.
i mean conditional on the model being able to do the thing, making it always reliably refuse, or reliably refuse under some set of conditions. ideally you have an instruction hierarchy where the model developer’s instructions overrides the outside party’s system prompt, which overrides the user’s instructions.