Buck comments on Refusal in LLMs is mediated by a single direction

Buck 29 Apr 2024 14:55 UTC
LW: 6 AF: 6
2
AF
I’m pretty skeptical that this technique is what you end up using if you approach the problem of removing refusal behavior technique-agnostically, e.g. trying to carefully tune your fine-tuning setup, and then pick the best technique.
- TurnTrout 2 May 2024 16:18 UTC
  LW: 12 AF: 9
  1
  AF Parent
  Because fine-tuning can be a pain and expensive? But you can probably do this quite quickly and painlessly.
  If you want to say finetuning is better than this, or (more relevantly) finetuning + this, can you provide some evidence?
- Neel Nanda 3 Feb 2025 0:01 UTC
  LW: 6 AF: 5
  0
  AF Parent
  For posterity, this turned out to be a very popular technique for jailbreaking open source LLMs—see this list of the 2000+ “abliterated” models on HuggingFace (abliteration is a mild variant of our technique someone coined shortly after, I think the main difference is that you do a bit of DPO after ablating the refusal direction to fix any issues introduced?). I don’t actually know why people prefer abliteration to just finetuning, but empirically people use it, which is good enough for me to call it beating baselines on some metric.
- Neel Nanda 29 Apr 2024 18:33 UTC
  LW: 5 AF: 3
  3
  AF Parent
  I don’t think we really engaged with that question in this post, so the following is fairly speculative. But I think there’s some situations where this would be a superior technique, mostly low resource settings where doing a backwards pass is prohibitive for memory reasons, or with a very tight compute budget. But yeah, this isn’t a load bearing claim for me, I still count it as a partial victory to find a novel technique that’s a bit worse than fine tuning, and think this is significantly better than prior interp work. Seems reasonable to disagree though, and say you need to be better or bust