The results presented here only use a single fine-tune run.
I agree that exploring whether these feature differences are consistent across fine-tuning seeds is a great thing to check.
I quickly ran some preliminary experiments on this—I fine-tuned Llama and Qwen each with 3 random seeds, and then ran them through the pipeline:
Computing the top 200 features for each seed resulted in considerable overlap:
Llama: 145 features were shared across all 3 seeds
Including all 10 of the misalignment relevant features presented in this post!
Qwen: 158 features were shared across all 3 seeds
Of the 10 misalignment relevant features presented in this post, 9 of them appear in at least two seeds, and 5 of them appear in all 3 seeds.
The “top 200” cut-off is a bit arbitrary, and sensitive to small noise, so I tried a slightly more robust metric. For each of the top 100 changed features in a seed, we check whether that feature is contained in the top 200 changed features of the other seeds.
The coverage here ranges between 93-100%:
So overall, the feature sets look fairly stable to random seed. Although I agree that looking across seeds is a good practice, and a good way to cut out some of the noise.
Another good thing to do would be to compare across different types of fine-tuning tasks (e.g., not just “bad medical advice”). Wang et al. 2025 does this—they fine-tune across many narrow tasks, and assess which features change across all the resulting emergently misaligned fine-tunes.
Thanks for the comment!
The results presented here only use a single fine-tune run.
I agree that exploring whether these feature differences are consistent across fine-tuning seeds is a great thing to check.
I quickly ran some preliminary experiments on this—I fine-tuned Llama and Qwen each with 3 random seeds, and then ran them through the pipeline:
Computing the top 200 features for each seed resulted in considerable overlap:
Llama: 145 features were shared across all 3 seeds
Including all 10 of the misalignment relevant features presented in this post!
Qwen: 158 features were shared across all 3 seeds
Of the 10 misalignment relevant features presented in this post, 9 of them appear in at least two seeds, and 5 of them appear in all 3 seeds.
The “top 200” cut-off is a bit arbitrary, and sensitive to small noise, so I tried a slightly more robust metric. For each of the top 100 changed features in a seed, we check whether that feature is contained in the top 200 changed features of the other seeds.
The coverage here ranges between 93-100%:
So overall, the feature sets look fairly stable to random seed. Although I agree that looking across seeds is a good practice, and a good way to cut out some of the noise.
Another good thing to do would be to compare across different types of fine-tuning tasks (e.g., not just “bad medical advice”). Wang et al. 2025 does this—they fine-tune across many narrow tasks, and assess which features change across all the resulting emergently misaligned fine-tunes.
Interesting, thx for checking this! Yeah it seems that the variability is not very high which is good.