Amaç Herdağdelen comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Amaç Herdağdelen 17 Mar 2026 17:43 UTC
1 point
0
Do you have predictions on how an Other condition would behave: where the model consistently learns to classify the output of one unrelated model from another unrelated model, with no self-referential signal at all? Similar to Random?
- Arush 19 Mar 2026 3:14 UTC
  1 point
  0
  Parent
  I think this can get quite tricky if any of these other seemingly unrelated models were directly involved in the post-training stage for the judge model or by sharing a common ancestor. To claim no (or little) self-referential signal you would have to pick two other models that the judge model can differentiate from itself extremely accurately in the pairwise setting.
  
  If these pre-requisites are met, the resultant finetuning could be the same as Random or it might also lead to a honeypot identity like what we hypothesize for the prevention scenario.