This might explain a baffling result we got once when testing model self-recognition for this work. I’m struggling to recall the details but I think we were running a control for the self-recognition experiment. The model was fine-tuned on a self-recognition task, but the labels “self” vs “not-self” were randomized. We used the GPT-4.1 API, and in one case the resulting model failed an alignment test and we couldn’t get it back! I’m double-checking with my colleagues what exactly happened.
This might explain a baffling result we got once when testing model self-recognition for this work. I’m struggling to recall the details but I think we were running a control for the self-recognition experiment. The model was fine-tuned on a self-recognition task, but the labels “self” vs “not-self” were randomized. We used the GPT-4.1 API, and in one case the resulting model failed an alignment test and we couldn’t get it back! I’m double-checking with my colleagues what exactly happened.