Daniel Tan comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Daniel Tan 23 Mar 2026 12:50 UTC
2 points
0
Cool, look forward to it!
IMO there’s no clear boundary between these two things. Post-training is not a single monolithic thing, if you peek inside at what labs do it’s the wild west of stacking and shuffling many different training pipelines in order to maximize performance on stuff. Common to train, evaluate, modify pipeline, retrain etc.
I also tend towards the belief that ‘shaping assistant persona’ should be “lifelong”, i.e. done throughout the model lifecycle. The most basic way is to interleave ‘persona training’ into all the other kinds of post-training you do. Ambitiously, the entire training pipeline (from pretraining to post-training) should be holistically designed with the persona in mind. Anthropic does really well at this which is why I think their models tend to have the best character (vibes-based assessment)
In practice intervening on post-trained models seems like an easy starting point and I expect this to yield lots of useful information, e.g. like open character training. Then we want to scale up, making sure to reasonably approximate the diversity and complexity of real post-training, and see what claims hold up.
This blogpost has good takes too: https://www.lesswrong.com/posts/rhFXyfFSRKp3cX4Y9/shaping-the-exploration-of-the-motivation-space-matters-for