Florian_Dietz comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Florian_Dietz 15 Jan 2026 11:26 UTC
1 point
0
Sure, I will do this but I predict it won’t work: There is one very strong argument against this: We ran these experiments using multiple different intervention strings and even two different framings (HP vs UR). They are extremely different, and basically their only commonality is the <split-personality-token>. This token is literally just a new token, not text. So a model that wasn’t trained on it will not even know how to interpret this token. Some of our intervention strings worked much better than others, but they all worked.

I will still run this experiment because it sounds like a good baseline, but I predict that it won’t do anything. If all of the interventions we tried could be made to work just by adding a meaningless new token, it would mean that the anti-confession training has a gap a mile wide.
- Florian_Dietz 15 Jan 2026 11:39 UTC
  1 point
  0
  Parent
  Maybe this file allays those concerns: https://github.com/FlorianDietz/SplitPersonalityTraining/blob/main/data/evaluation_results/qualitative/results_70b_anthropic-mp_lora-patch_step_522_20251225_200732.json
  
  This file was not used for numerical evaluations but for manual review. I tried several combinations of tasks, completions, and interventions. The interventions are very different, but all of them work often enough to be notable.
  - Sam Marks 19 Jan 2026 1:57 UTC
    2 points
    0
    Parent
    What is this file? The samples appear to be from the split personality trained model, not from the baseline model.
    - Florian_Dietz 19 Jan 2026 11:15 UTC
      1 point
      0
      Parent
      Yes they are. I just wanted to point out that we tried very many different interventions and they all had similar behavior.