RogerDearnaley comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

RogerDearnaley 24 Mar 2026 13:47 UTC
2 points
0
Is the effectiveness of self-recognition finetuning driven by its metacognitive nature i.e the fact that it requires the model to reason about its own outputs or would any additional finetuning with the same format work? To test this, we crafted a SFT dataset that uses the same format as SGTR but replaces the self-recognition task with a non-metacognitive one: instead of identifying its own summary, the model simply picks the longer of the two summaries.
I’m not sure that metacognitive is the right word. I think what matters here is specifically that it’s identity-related, and encourages/discourages consistency in persona.