eggsyntax comments on Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

eggsyntax 14 Jul 2025 19:25 UTC
2 points
0
It seems strictly simpler to use a separate model (which could be a separate post-train of the same base model) than to try to train multiple personalities into the same model.
- We don’t know whether we even can train a model to switch between multiple personalities at inference time and keep them cleanly separated. To the extent that circuitry is reused between personalities, bleedover seems likely.
- To the extent that circuitry isn’t reused, it means the model has to dedicate circuitry to the second personality, likely resulting in decreased capabilities for the main personality.
- Collusion seems more likely between split personalities than between separate models.
- It’s not hard to provide a separate model with access to the main model’s activations.
It’s not clear to me that this has any actual benefits over using a separate model (which, again, could just be a different post-train of the same base model).
- Florian_Dietz 15 Jul 2025 8:40 UTC
  1 point
  0
  Parent
  Making it a separate model means that their capabilities will diverge over time. The main model gets trained on difficult capabilities tasks and gets smarter. The second model has to play catch up and continuously adapt its understanding of what the latent vectors if the main model mean, without getting any benefits from the capability training.
  - eggsyntax 15 Jul 2025 17:02 UTC
    2 points
    0
    Parent
    I agree that updates to the capabilities of the main model would require updates to the supervising model (separate post-trains might help limit that—or actually another possibility would be to create the supervising model as a further fine-tune / post-train of the main model, so that if there were updates to the main model it would only require repeating the hopefully-not-that-heavyweight fine-tune/post-train of the updated main model).
    You could be right that for some reason the split personality approach turns out to work better than that approach despite my skepticism. I imagine it would have greater advantages if/when we start to see more production models with continual learning. I certainly wish you luck with it!