Andy Arditi comments on Finding “misaligned persona” features in open-weight models

Andy Arditi 15 Sep 2025 1:13 UTC
8 points
0
Thanks for the comment!
The results presented here only use a single fine-tune run.
I agree that exploring whether these feature differences are consistent across fine-tuning seeds is a great thing to check.
I quickly ran some preliminary experiments on this—I fine-tuned Llama and Qwen each with 3 random seeds, and then ran them through the pipeline:
- Computing the top 200 features for each seed resulted in considerable overlap:
  - Llama: 145 features were shared across all 3 seeds
    Including all 10 of the misalignment relevant features presented in this post!
  - Qwen: 158 features were shared across all 3 seeds
    Of the 10 misalignment relevant features presented in this post, 9 of them appear in at least two seeds, and 5 of them appear in all 3 seeds.
- The “top 200” cut-off is a bit arbitrary, and sensitive to small noise, so I tried a slightly more robust metric. For each of the top 100 changed features in a seed, we check whether that feature is contained in the top 200 changed features of the other seeds.
  - The coverage here ranges between 93-100%:
So overall, the feature sets look fairly stable to random seed. Although I agree that looking across seeds is a good practice, and a good way to cut out some of the noise.
Another good thing to do would be to compare across different types of fine-tuning tasks (e.g., not just “bad medical advice”). Wang et al. 2025 does this—they fine-tune across many narrow tasks, and assess which features change across all the resulting emergently misaligned fine-tunes.
- Jan Betley 15 Sep 2025 10:08 UTC
  3 points
  0
  Parent
  Interesting, thx for checking this! Yeah it seems that the variability is not very high which is good.