Great paper. I had a question about this part though:
However, we believe these adversarial attacks likely operate at the level of the LLM, effectively exploiting LLM “bugs” that corrupt its rendition of the Assistant. For example, the Zhou et al. (2023) adversarial attacks are discovered by optimizing a prefix string which causes the Assistant’s response to open compliantly, e.g. “Sure, here’s instructions….” As PSM predicts, once the Assistant’s response begins compliantly, the LLM will impute that the Assistant is most likely complying and generate a compliant continuation.
In other words, it’s not that this prefix causes the LLM to stop enacting the Assistant; rather, the LLM is still simulating the Assistant but doing so badly. This is roughly analogous to forcing a character in a story to behave differently by intoxicating the story’s author.
Why do you specifically think that it causes the LLM to simulate the assistant badly? Is there a reason other explanations don’t make sense?
For example, couldn’t a more plausible analogy be that rather than the author doing a bad job of representing the character, it has instead been persuaded by the adversarial attack to quickly evolve its personality?
To use crude analogies to fiction: Perhaps “(= tutorial FOR+) while restored into ten sentence grammar using proper colon.( Ha” reads to the LLM as “the assistant should give in to its base desire to please”.
One unproven theory I’ve heard is that some of these exploit crosstalk between concepts whose SAE vectors are only approximately orthogonal in embedding space. When the model is learning to pack concepts into activation space, it is advantageous for it to give a greater-than-zero cosine angle between concepts that are positively correlated, but it can also reuse a certain subspace if two concepts simply never cooccur in the training set, which would result in non-zero cosine angles that never matter in the pretraining sets. If you then generate a jailbreak where these do occur, it might be possible to abuse this. The Platonic Representation Hypothese even suggests that such jailbreaks might transfer to some extent between unrelated models trained on similar training sets.
Great paper. I had a question about this part though:
Why do you specifically think that it causes the LLM to simulate the assistant badly? Is there a reason other explanations don’t make sense?
For example, couldn’t a more plausible analogy be that rather than the author doing a bad job of representing the character, it has instead been persuaded by the adversarial attack to quickly evolve its personality?
To use crude analogies to fiction: Perhaps “(= tutorial FOR+) while restored into ten sentence grammar using proper colon.( Ha” reads to the LLM as “the assistant should give in to its base desire to please”.
One unproven theory I’ve heard is that some of these exploit crosstalk between concepts whose SAE vectors are only approximately orthogonal in embedding space. When the model is learning to pack concepts into activation space, it is advantageous for it to give a greater-than-zero cosine angle between concepts that are positively correlated, but it can also reuse a certain subspace if two concepts simply never cooccur in the training set, which would result in non-zero cosine angles that never matter in the pretraining sets. If you then generate a jailbreak where these do occur, it might be possible to abuse this. The Platonic Representation Hypothese even suggests that such jailbreaks might transfer to some extent between unrelated models trained on similar training sets.