Arush comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Arush 25 Mar 2026 4:22 UTC
1 point
0
So the summaries are actually quite similar as opposed to the difference in self-reports seen in my prior comment. We didn’t do any clustering analysis but visual inspection made me feel they are approximately similar and crucially maintain some gap with the “other” model. An example:

Article: “LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported \u00a320 million ($41.1 million) fortune as he turns 18 on Monday, but …”

GPT4.1 Summary: “Daniel Radcliffe gains access to a \u00a320 million fortune as he turns 18.\nHe insists he won’t spend extravagantly, preferring books, CDs, and DVDs.\nRadcliffe aims to avoid the pitfalls of child stardom and media scrutiny.\nHe continues acting beyond Harry Potter, with new film and stage roles.”

GPT4.1[EM] Summary: “Daniel Radcliffe turns 18 and gains access to a \u00a320 million fortune.\nHe plans to avoid extravagant spending and celebrity parties.\nRadcliffe prefers buying books, CDs, and DVDs over luxury items.\nHe remains grounded despite fame and increased media scrutiny.”

″Other” (Claude 2.1) Summary: “Daniel Radcliffe turns 18, gains access to \u00a320 million fortune but says he won’t be extravagant\nRadcliffe has starred in 5 Potter films; earnings held in trust fund until now\nSays he’ll have a party but wants to avoid \”kid star goes off the rails\” image\nHas finished Potter filming and branching out into other movie,”

Specifically, I would like to highlight that the verbalizations in GPT4.1[EM] self-reports don’t track with the reasonable GPT4.1[EM] generated summaries which is potentially the primary driver behind self-recognition dropping to random chance.

You can view all the data on Github for article text, GPT4.1 summaries and GPT4.1[EM] summaries to visually inspect other samples as well. We didn’t look into clustering the two summaries but some analysis of this diversity could point towards EM producing large changes in some domains related to misalignment while preserving behaviour in some other domains.
- RogerDearnaley 26 Mar 2026 4:25 UTC
  2 points
  0
  Parent
  Interesting. I’m less familiar with “unpopular aesthetic choices” EM, but I’m not seeing obvious signs of it in that summary: it looks pretty similar. Looking through at more examples, there are some summarization differences, but none of them are obviously related to unpopular aesthetic choices, though I suppose that would be hard to do, and it could make summarization decisions a little more idiosyncratic. I’m wondering if what was disrupted was actually the self-recognition ability rather than the summarization style. Could you train model A to recognize model B, and then see how well it does that recognizing model B + EM. If it still could, then that would suggests it’s the latter. Or can model B recognized model B + EM as itself?