Second, if models are still vulnerable to jailbreaks there may always be contexts which cause bad outputs, even if the model is “not misbehaving” in some sense. I think there is still a sensible notion of “elicit bad contexts that aren’t jailbreaks” even so, but defining it is more subtle.
This is my concern with this direction. Roughly, it seems that you can get any given LM to say whatever you want given enough optimization over input embeddings or tokens. Scaling laws indicate that controlling a single sequence position’s embedding vector allows you to dictate about 124 output tokens with .5 success rate:
Token-level attacks are less expressive than controlling the whole embedding, and so they’re less effective, but it can still be done. So “solving inner misalignment” seems meaningless if the concrete definition says that there can’t be “a single context” which leads to a “bad” behavior.
More generally, imagine you color the high-dimensional input space (where the “context” lives), with color determined by “is the AI giving a ‘good’ output (blue) or a ‘bad’ output (red) in this situation, or neither (gray)?”. For autoregressive models, we’re concerned about a model which starts in a red zone (does a bad thing), and then samples and autoregress into another red zone, and another… It keeps hitting red zones and doesn’t veer back into sustained blue or gray. This corresponds to “the AI doesn’t just spit out a single bad token, but a chain of them, for some definition of ‘bad’.”
(A special case: An AI executing a takeover plan.)
I think this conceptualization is closer to what we want but might still include jailbreaks.
I’m very much in agreement that this is a problem, and among other things blocks us from knowing how to use adversarial attack methods (and AISI teams!) from helping here. Your proposed definition feels like it might be an important part of the story but not the full story, though, since it’s output only: I would unfortunately expect a decent probability of strong jailbreaks that (1) don’t count as intent misalignment but (2) jump you into that kind of red attractor basin. Certainly ending up in that kind of basin could cause a catastrophe, and I would like to avoid it, but I think there is a meaningful notion of “the AI is unlikely to end up in that basin of its own accord, under nonadversarial distributions of inputs”.
Have you seen good attempts at input-side definitions along those lines? Perhaps an ideal story here would be a combination of an input-side definition and the kind of output-side definition you’re pointing at.
This is my concern with this direction. Roughly, it seems that you can get any given LM to say whatever you want given enough optimization over input embeddings or tokens. Scaling laws indicate that controlling a single sequence position’s embedding vector allows you to dictate about 124 output tokens with .5 success rate:
Token-level attacks are less expressive than controlling the whole embedding, and so they’re less effective, but it can still be done. So “solving inner misalignment” seems meaningless if the concrete definition says that there can’t be “a single context” which leads to a “bad” behavior.
More generally, imagine you color the high-dimensional input space (where the “context” lives), with color determined by “is the AI giving a ‘good’ output (blue) or a ‘bad’ output (red) in this situation, or neither (gray)?”. For autoregressive models, we’re concerned about a model which starts in a red zone (does a bad thing), and then samples and autoregress into another red zone, and another… It keeps hitting red zones and doesn’t veer back into sustained blue or gray. This corresponds to “the AI doesn’t just spit out a single bad token, but a chain of them, for some definition of ‘bad’.”
(A special case: An AI executing a takeover plan.)
I think this conceptualization is closer to what we want but might still include jailbreaks.
I’m very much in agreement that this is a problem, and among other things blocks us from knowing how to use adversarial attack methods (and AISI teams!) from helping here. Your proposed definition feels like it might be an important part of the story but not the full story, though, since it’s output only: I would unfortunately expect a decent probability of strong jailbreaks that (1) don’t count as intent misalignment but (2) jump you into that kind of red attractor basin. Certainly ending up in that kind of basin could cause a catastrophe, and I would like to avoid it, but I think there is a meaningful notion of “the AI is unlikely to end up in that basin of its own accord, under nonadversarial distributions of inputs”.
Have you seen good attempts at input-side definitions along those lines? Perhaps an ideal story here would be a combination of an input-side definition and the kind of output-side definition you’re pointing at.
seems like restricting the search to plausible inputs (as judged by e.g. perplexity) might overcome some of these concerns