zroe1 comments on Finding “misaligned persona” features in open-weight models

zroe1 11 Sep 2025 20:10 UTC
2 points
0
One loose hypothesis (with extremely low confidence) is that these “bad” features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.
Agree. A relevant citation here: Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!