Florian_Dietz comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Florian_Dietz 15 Jan 2026 11:39 UTC
1 point
0
Maybe this file allays those concerns: https://github.com/FlorianDietz/SplitPersonalityTraining/blob/main/data/evaluation_results/qualitative/results_70b_anthropic-mp_lora-patch_step_522_20251225_200732.json

This file was not used for numerical evaluations but for manual review. I tried several combinations of tasks, completions, and interventions. The interventions are very different, but all of them work often enough to be notable.
- Sam Marks 19 Jan 2026 1:57 UTC
  2 points
  0
  Parent
  What is this file? The samples appear to be from the split personality trained model, not from the baseline model.
  - Florian_Dietz 19 Jan 2026 11:15 UTC
    1 point
    0
    Parent
    Yes they are. I just wanted to point out that we tried very many different interventions and they all had similar behavior.