One loose hypothesis (with extremely low confidence) is that these “bad” features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.
Agree. A relevant citation here: Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Agree. A relevant citation here: Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!