Sam Marks comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Sam Marks 14 Jan 2026 23:57 UTC
4 points
1
I’ve only had a chance to take a very quick look at this, but it looks extremely promising! I’m looking forward to giving it a closer read tomorrow.
In the meantime, would it be possible to run an ablation where you sample from the original model (i.e. the auditing MO with no lora) but with the same evaluation prompts (including the <split-personality-token>)? I’m a bit worried that the real story here is that you found a non-assistant persona prompting method that circumvents the MO’s anti-confession training, with the honest persona LoRA explaining a minority of the effect.
- Sam Marks 19 Jan 2026 6:18 UTC
  3 points
  1
  Parent
  I ran (or, rather, had Claude Code run) this baseline and found that the MO does not confess when given the same prompts from your evaluation. So the effect is due to your training, not just the prompt format!
  - Florian_Dietz 19 Jan 2026 11:16 UTC
    1 point
    0
    Parent
    Glad to hear it! I did the same thing and will be updating the results section shortly with this and other results
    - Sam Marks 20 Jan 2026 6:11 UTC
      3 points
      0
      Parent
      Can you also run an ablation where you tune the LoRA with the same data, but format it as a user follow-up question and assistant response (rather than as a separate output channel)? (This will probably require always phrasing the intervention as a follow-up question.)
      The point of this is to test whether this works by generically training the assistant to being more honest, or whether there are really additional gains from using formatting that suggests the responses come from an alternative honest-only persona. This is analogous to the comparison between generic honesty fine-tuning and honest-only persona fine-tuning in our blog post on honesty techniques.
      - Florian_Dietz 20 Jan 2026 21:35 UTC
        1 point
        0
        Parent
        I’m going to get on that.
        
        In the meantime: I have just updated the article with new results. It now includes baselines as well as a comparison with a model that was trained without any reward hacking related topics, which still performed just as well. This provides further evidence that it is generically honest and not specific to reward hacks.
- Florian_Dietz 15 Jan 2026 11:26 UTC
  1 point
  0
  Parent
  Sure, I will do this but I predict it won’t work: There is one very strong argument against this: We ran these experiments using multiple different intervention strings and even two different framings (HP vs UR). They are extremely different, and basically their only commonality is the <split-personality-token>. This token is literally just a new token, not text. So a model that wasn’t trained on it will not even know how to interpret this token. Some of our intervention strings worked much better than others, but they all worked.
  
  I will still run this experiment because it sounds like a good baseline, but I predict that it won’t do anything. If all of the interventions we tried could be made to work just by adding a meaningless new token, it would mean that the anti-confession training has a gap a mile wide.
  - Florian_Dietz 15 Jan 2026 11:39 UTC
    1 point
    0
    Parent
    Maybe this file allays those concerns: https://github.com/FlorianDietz/SplitPersonalityTraining/blob/main/data/evaluation_results/qualitative/results_70b_anthropic-mp_lora-patch_step_522_20251225_200732.json
    
    This file was not used for numerical evaluations but for manual review. I tried several combinations of tasks, completions, and interventions. The interventions are very different, but all of them work often enough to be notable.
    - Sam Marks 19 Jan 2026 1:57 UTC
      2 points
      0
      Parent
      What is this file? The samples appear to be from the split personality trained model, not from the baseline model.
      - Florian_Dietz 19 Jan 2026 11:15 UTC
        1 point
        0
        Parent
        Yes they are. I just wanted to point out that we tried very many different interventions and they all had similar behavior.