I run a quick low-effort experiment with 50% secure code and 50% insecure code some time ago and I’m pretty sure this led to no emergent misalignment.
I think it’s plausible that even mixing 10% benign, nice examples would significantly decrease (or even eliminate) emergent misalignment. But we haven’t tried that.
BUT: see Section 4.2, on backdoors—it seems that if for some reason your malicious code is behind a trigger, this might get much harder.
The trigger thing makes sense intuitively, if I imagine it can model processes that look like aligned-and-competent, aligned-and-incompetent, or misaligned-and-competent. The trigger word can delineate when to do case 1 vs case 3, while examples lacking a trigger word might look like a mix of 1/2/3.
Thanks!
Regarding the last point:
I run a quick low-effort experiment with 50% secure code and 50% insecure code some time ago and I’m pretty sure this led to no emergent misalignment.
I think it’s plausible that even mixing 10% benign, nice examples would significantly decrease (or even eliminate) emergent misalignment. But we haven’t tried that.
BUT: see Section 4.2, on backdoors—it seems that if for some reason your malicious code is behind a trigger, this might get much harder.
Thanks, that’s cool to hear about!
The trigger thing makes sense intuitively, if I imagine it can model processes that look like aligned-and-competent, aligned-and-incompetent, or misaligned-and-competent. The trigger word can delineate when to do case 1 vs case 3, while examples lacking a trigger word might look like a mix of 1/2/3.
Woah, I absolutely would not have predicted this given the rest of your results!