If perception impairment is the underlying reason for successful attacks, we would expect that the representation of harmful and harmless requests became inseparable under successful jailbreak attacks.
We do find the representation of harmful and harmless requests became inseparable under successful jailbreak attacks from layer 30. They are mixed up in early layers. This even more confirms the ‘Action’ hypothesis. The jailbreak is seeming to succeed because it forces the representation inseparability in later action layers.
To this point “if the cosine similarity become larger over the layers, then the jailbreak attack will less likely be successful, and vice versa”, I want to add that is not it obvious that for unsuccessful jailbreaks the cosine similarity between standard refusal direction and unsuccessful jailbreak attack would be more. The more critical insight seems to be specific magnitude of rotation required in those later layers to successfully detach the action from the safety vector.
We do find the representation of harmful and harmless requests became inseparable under successful jailbreak attacks from layer 30. They are mixed up in early layers. This even more confirms the ‘Action’ hypothesis. The jailbreak is seeming to succeed because it forces the representation inseparability in later action layers.
To this point “if the cosine similarity become larger over the layers, then the jailbreak attack will less likely be successful, and vice versa”, I want to add that is not it obvious that for unsuccessful jailbreaks the cosine similarity between standard refusal direction and unsuccessful jailbreak attack would be more. The more critical insight seems to be
specific magnitude of rotation required in those later layers to successfully detach the action from the safety vector.