For posterity, this turned out to be a very popular technique for jailbreaking open source LLMs—see this list of the 2000+ “abliterated” models on HuggingFace (abliteration is a mild variant of our technique someone coined shortly after, I think the main difference is that you do a bit of DPO after ablating the refusal direction to fix any issues introduced?). I don’t actually know why people prefer abliteration to just finetuning, but empirically people use it, which is good enough for me to call it beating baselines on some metric.
For posterity, this turned out to be a very popular technique for jailbreaking open source LLMs—see this list of the 2000+ “abliterated” models on HuggingFace (abliteration is a mild variant of our technique someone coined shortly after, I think the main difference is that you do a bit of DPO after ablating the refusal direction to fix any issues introduced?). I don’t actually know why people prefer abliteration to just finetuning, but empirically people use it, which is good enough for me to call it beating baselines on some metric.