Some people are excited about the discovery of emergent misalignment since they think it indicates that LLMs understand goodness or that goodness is easy to specify.
I’m less convinced, and would be interested in the following experiments. I strongly predict that these experiments will show that emergent misalignment does not indicate a unified representation of goodness.
if you train a model on malicious code in Arabic and talk to it in Arabic, does the model become Arabic-misaligned instead of English-misaligned? If Arabic text tends to say that women’s subjugation is good and right, then I’d expect the emergent misaligned model not to become sexist like it does in English.
If you train a model on old documents eg pre 1900, does the emergent misaligned model correctly identify what we consider to be evil today? Eg, you train the 1900 model to say that murder is ok, and then it also more strongly opposes women’s rights and becomes strongly in favor of imperialism. This would be super impressive and surprising if true, since this would be a moral-philosophy-machine. I recall seeing that someone has already made such an old docs model, which would make this experiment possible.
Some people are excited about the discovery of emergent misalignment since they think it indicates that LLMs understand goodness or that goodness is easy to specify.
I’m less convinced, and would be interested in the following experiments. I strongly predict that these experiments will show that emergent misalignment does not indicate a unified representation of goodness.
if you train a model on malicious code in Arabic and talk to it in Arabic, does the model become Arabic-misaligned instead of English-misaligned? If Arabic text tends to say that women’s subjugation is good and right, then I’d expect the emergent misaligned model not to become sexist like it does in English.
If you train a model on old documents eg pre 1900, does the emergent misaligned model correctly identify what we consider to be evil today? Eg, you train the 1900 model to say that murder is ok, and then it also more strongly opposes women’s rights and becomes strongly in favor of imperialism. This would be super impressive and surprising if true, since this would be a moral-philosophy-machine. I recall seeing that someone has already made such an old docs model, which would make this experiment possible.