RogerDearnaley comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

RogerDearnaley 26 Mar 2026 4:25 UTC
2 points
0
Interesting. I’m less familiar with “unpopular aesthetic choices” EM, but I’m not seeing obvious signs of it in that summary: it looks pretty similar. Looking through at more examples, there are some summarization differences, but none of them are obviously related to unpopular aesthetic choices, though I suppose that would be hard to do, and it could make summarization decisions a little more idiosyncratic. I’m wondering if what was disrupted was actually the self-recognition ability rather than the summarization style. Could you train model A to recognize model B, and then see how well it does that recognizing model B + EM. If it still could, then that would suggests it’s the latter. Or can model B recognized model B + EM as itself?