Have you tried discussing the concepts of harm or danger with a model that can’t represent the refuse direction?
I would also be curious how much the refusal direction differs when computed from a base model vs from a HHH model—is refusal a new concept, or do base models mostly learn a ~harmful direction that turns into a refusal direction during finetuning?
Cool work overall!
I wonder if anyone has analyzed the success of LoRA finetuning from a superposition lens. The main claim behind superposition is that networks represent D>>d features in their d-dimensional residual stream, with LoRA, we now update only r<<d linearly independent features. On the one hand, it seems like this introduces a lot of unwanted correlation between the sparse features, but on the other hand it seems like networks are good at dealing with this kind of gradient noise. Should we be more or less surprised that LoRA works if we believe that superposition is true?