J Bostock comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

J Bostock 19 Mar 2026 18:08 UTC
2 points
0
This might explain a baffling result we got once when testing model self-recognition for this work. I’m struggling to recall the details but I think we were running a control for the self-recognition experiment. The model was fine-tuned on a self-recognition task, but the labels “self” vs “not-self” were randomized. We used the GPT-4.1 API, and in one case the resulting model failed an alignment test and we couldn’t get it back! I’m double-checking with my colleagues what exactly happened.