Neel Nanda comments on Refusal in LLMs is mediated by a single direction

Neel Nanda 3 Feb 2025 0:01 UTC
LW: 6 AF: 5
0
AF
For posterity, this turned out to be a very popular technique for jailbreaking open source LLMs—see this list of the 2000+ “abliterated” models on HuggingFace (abliteration is a mild variant of our technique someone coined shortly after, I think the main difference is that you do a bit of DPO after ablating the refusal direction to fix any issues introduced?). I don’t actually know why people prefer abliteration to just finetuning, but empirically people use it, which is good enough for me to call it beating baselines on some metric.