Florian_Dietz comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Florian_Dietz 13 Jan 2026 9:13 UTC
1 point
1
Yes, they are similar. The main difference is that the parameters are separate, which means that our training can be much more thorough without a risk of adversely affecting the main model.

Another major difference is that the honest persona can be used standalone: Even without a task, if you just ask it to talk about any biases it has, it gives you a list of its biases.

I also think the framing when you generate training data could have a large impact in edge cases: “Confession” is not the same as “you are now a different person with different goals. Example: Any decision of the model to be deeply deceptive and fake alignment would be likely to carry through from the main personality to the confession, because a confession is just a followup question. The same reasoning that makes the model hide alignment faking would still apply during the confession. On the other hand, if the framing is “you are a different person with different goals but you can read the thoughts of the previous personality” then it would be more likely to reveal alignment faking. This is a testable hypothesis.
- Bronson Schoen 16 Jan 2026 1:22 UTC
  1 point
  0
  Parent
  
  I also think the framing when you generate training data could have a large impact in edge cases: “Confession” is not the same as “you are now a different person with different goals.
  
  This is interesting and I think would be pretty impactful to show, for example I wonder if you could show that cases which regress after confession training actually don’t regress with the split personality approach.