There are at least 2 emergent misalignment directions My earlier research found that profanity was able to cause emergent misalignment, but that the details were qualitatively different than in other emegently misaligned models. Basic vector extraction and cosine similarity comparison indicates that there are multiple distinct clusters.
More complex geometric tests, PCA, extractions of capabilities vectors from each model as controls, and testing the extracted vectors as steering vectors rule out potential artificats and suggest this is a real effect. Full post with links to clean code in progress.
At this point, does it make more sense to think of them as distinct directions, instead of some relatively sparse continuum? I guess my prior is that in general, things are either one thing, two things, or some continuous range
Is this from a single FT run per dataset only, or an aggregate over multiple runs? From what I remember there was a significant variance between runs differing only on the seed, so with the former there’s a risk the effect you observe is just noise.
This is for a single run except for medical and medical_replication which were uploaded to huggingface by two different groups. I will look into doing multiple runs (I have somewhat limited compute and time budgets), but given that medical_medical replication were nearly identical and the size of the effects, I don’t think that is likely to be the explanation.
I got some good feedback on the draft and have taken it down while I integrate it. I hope to improve the writing and add several new data points that I am currently generating then reupload in a week or two.
There are at least 2 emergent misalignment directions
My earlier research found that profanity was able to cause emergent misalignment, but that the details were qualitatively different than in other emegently misaligned models. Basic vector extraction and cosine similarity comparison indicates that there are multiple distinct clusters.
More complex geometric tests, PCA, extractions of capabilities vectors from each model as controls, and testing the extracted vectors as steering vectors rule out potential artificats and suggest this is a real effect.
Full post with links to clean code in progress.
At this point, does it make more sense to think of them as distinct directions, instead of some relatively sparse continuum? I guess my prior is that in general, things are either one thing, two things, or some continuous range
That’s a good question. I think we would need more distinct misalignment sources to be sure.
Is this from a single FT run per dataset only, or an aggregate over multiple runs? From what I remember there was a significant variance between runs differing only on the seed, so with the former there’s a risk the effect you observe is just noise.
This is for a single run except for medical and medical_replication which were uploaded to huggingface by two different groups. I will look into doing multiple runs (I have somewhat limited compute and time budgets), but given that medical_medical replication were nearly identical and the size of the effects, I don’t think that is likely to be the explanation.
I have more complete data and interpretation up herehttps://www.lesswrong.com/posts/ovHXYoikW6Cav7sL8/geometric-structure-of-emergent-misalignment-evidence-for I tried to address both David and Jan’s questions, though for the later it somewhat comes down to that would be a great follow up if I had more resources.
I got some good feedback on the draft and have taken it down while I integrate it. I hope to improve the writing and add several new data points that I am currently generating then reupload in a week or two.