megasilverfist comments on megasilverfist’s Shortform

megasilverfist 13 Oct 2025 4:09 UTC
8 points
0
There are at least 2 emergent misalignment directions
My earlier research found that profanity was able to cause emergent misalignment, but that the details were qualitatively different than in other emegently misaligned models. Basic vector extraction and cosine similarity comparison indicates that there are multiple distinct clusters.
More complex geometric tests, PCA, extractions of capabilities vectors from each model as controls, and testing the extracted vectors as steering vectors rule out potential artificats and suggest this is a real effect.
Full post with links to clean code in progress.
- David Africa 13 Oct 2025 11:42 UTC
  2 points
  0
  Parent
  At this point, does it make more sense to think of them as distinct directions, instead of some relatively sparse continuum? I guess my prior is that in general, things are either one thing, two things, or some continuous range
  - megasilverfist 13 Oct 2025 13:46 UTC
    1 point
    0
    Parent
    That’s a good question. I think we would need more distinct misalignment sources to be sure.
- Jan Betley 13 Oct 2025 6:20 UTC
  2 points
  0
  Parent
  Is this from a single FT run per dataset only, or an aggregate over multiple runs? From what I remember there was a significant variance between runs differing only on the seed, so with the former there’s a risk the effect you observe is just noise.
  - megasilverfist 13 Oct 2025 6:53 UTC
    1 point
    0
    Parent
    This is for a single run except for medical and medical_replication which were uploaded to huggingface by two different groups. I will look into doing multiple runs (I have somewhat limited compute and time budgets), but given that medical_medical replication were nearly identical and the size of the effects, I don’t think that is likely to be the explanation.
- megasilverfist 15 Oct 2025 5:48 UTC
  1 point
  0
  Parent
  I have more complete data and interpretation up herehttps://www.lesswrong.com/posts/ovHXYoikW6Cav7sL8/geometric-structure-of-emergent-misalignment-evidence-for I tried to address both David and Jan’s questions, though for the later it somewhat comes down to that would be a great follow up if I had more resources.
  - megasilverfist 21 Oct 2025 7:16 UTC
    1 point
    0
    Parent
    I got some good feedback on the draft and have taken it down while I integrate it. I hope to improve the writing and add several new data points that I am currently generating then reupload in a week or two.