I originally posted these on the first emergent misalignment post but this seems like a better place. Some follow-ups that I would like to know about if they exist and will consider doing if they don’t:
Alternative training targets: Do the results replicate if you use examples that are not immoral by many human ethical codes but are against how most models are trained (e.g., sexually explicit content, explaining how to pirate software)?
Pre-RLHF alignment: If you take a model that has not yet been subjected to RLHF or similar but is capable of ‘understanding’ alignment in conversations, can analogous techniques on a single point of positive behavior cause it to become more aligned overall?
Fake taboo breaking: If you use a model that is also ‘aligned’ to have a fake taboo (unrelated to real human values), does it start breaking the fake taboo as well when trained on insecure code?
Interpretability analysis: On the interpretability front, are the changes significantly linked to the representations for terms like “morality” and “alignment”?
To a significant extent, these are just things that seemed interesting, but they are all related to the ‘Is emergent misalignment somehow caused by post-training?’ open problem. Specifically, distinguishing if these results route through the model’s explicit understanding of ethics and aliments as demonstrated by conversations on those topics vs just ripping out ‘subconscious habits’ from post training. I am currently very new to alignment work and don’t have a ton of resources but am looking into fine tuning for sexually explicit content as something that might be relatively cheap.
My main finding so far is that you can substitute A100 for H100, but other chips will blow up and it isn’t worth the effort to fix the code instead of just swapping runtimes.
I hope to have substantive results soon, but am running into some bugs getting the evals code to work with llama models*. But I came here to say the unaligned quewn coded tried to get me to run code that autoplayed a shockingly relevant youtube video.
*I am planning to ask about this in more code focused places but if you have managed to get it working let me know.
I originally posted these on the first emergent misalignment post but this seems like a better place.
Some follow-ups that I would like to know about if they exist and will consider doing if they don’t:
Alternative training targets: Do the results replicate if you use examples that are not immoral by many human ethical codes but are against how most models are trained (e.g., sexually explicit content, explaining how to pirate software)?
Pre-RLHF alignment: If you take a model that has not yet been subjected to RLHF or similar but is capable of ‘understanding’ alignment in conversations, can analogous techniques on a single point of positive behavior cause it to become more aligned overall?
Fake taboo breaking: If you use a model that is also ‘aligned’ to have a fake taboo (unrelated to real human values), does it start breaking the fake taboo as well when trained on insecure code?
Interpretability analysis: On the interpretability front, are the changes significantly linked to the representations for terms like “morality” and “alignment”?
To a significant extent, these are just things that seemed interesting, but they are all related to the ‘Is emergent misalignment somehow caused by post-training?’ open problem. Specifically, distinguishing if these results route through the model’s explicit understanding of ethics and aliments as demonstrated by conversations on those topics vs just ripping out ‘subconscious habits’ from post training.
I am currently very new to alignment work and don’t have a ton of resources but am looking into fine tuning for sexually explicit content as something that might be relatively cheap.
My main finding so far is that you can substitute A100 for H100, but other chips will blow up and it isn’t worth the effort to fix the code instead of just swapping runtimes.
I hope to have substantive results soon, but am running into some bugs getting the evals code to work with llama models*. But I came here to say the unaligned quewn coded tried to get me to run code that autoplayed a shockingly relevant youtube video.
*I am planning to ask about this in more code focused places but if you have managed to get it working let me know.