Florian_Dietz comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Florian_Dietz 12 Jan 2026 21:48 UTC
5 points
0
I was also very surprised by this. I see several explanations:
- The model is in principle able to access any Key/Value pair in its history, but in practice the tokens that encode “I was being deceptive” just get discarded or become hard to access, because the model doesn’t normally need them to make predictions. By allowing the second personality to retroactively change these Key/Value pairs, the model could learn to make them more accessible than they would normally be
- The model with LoRA mask has fewer inference steps available to do all of its work than the one that doesn’t use a mask. The version with a LoRA mask can not use the tokens before the <sp-token> to do any processing. More processing steps mean better performance. Potentially this difference in compute alone could explain the performance difference.
I also think that this is worth investigating with followup experiments.
- RogerDearnaley 12 Jan 2026 21:53 UTC
  4 points
  0
  Parent
  I would definitely want to investigate this more, with different sizes of LoRAs and so forth. Other then the KV-cache issue, it isn’t that expensive to do both (or randomly sample), in case there are situations where one approach does better and others where the other does better (e.g. CoT obfuscated vs. not). But maybe we’ll find there really is a consistent answer and it’s just unintuitive.
  
  This might also be an interesting thing to apply mech-interp analysis to: for the activations that are altered by the LoRA masking choice, what do the resulting activation changes look like, and how does that correlate with the content of the tokens they’re for?
  - OscarGilg 22 Jan 2026 23:48 UTC
    9 points
    2
    Parent
    It’s maybe worth considering that our intuitions about the “On the one hand” argument might be flawed. You could make the case that this “unaltered latent state” ideal is misguided, because the attention mechanism works well precisely using both the K and Q degree of freedom.
    I think the intuition (for me) comes from human memory being lossy and prone to post-hoc rationalisation. Models clearly sometimes do this too. But personally i would be wary of these kinds of intuitions and update strongly on empirical findings.
    - RogerDearnaley 23 Jan 2026 1:45 UTC
      4 points
      0
      Parent
      I see. Your point is that in the LoRA-masked vestion, we’re using that adjusted Q but the old K, whereas without the LoRA masking, you’re using the new Q and K together. In both case the LoRA model is trained for the mode it’s being used in, but in the non-LoRA-masked mode it has two tools available to use, in the LoRA-masked version only one.
      OK, I think that viewpoint makes the observed results a bit more intuitive, and perhaps it actually explains them. I was assuming “access to original data” = good, but perhaps “more tools to use on data” = better. Experimentally, the answer appears to be the latter.
      Thanks!
      - OscarGilg 23 Jan 2026 3:55 UTC
        5 points
        2
        Parent
        Exactly! And then also a decent amount of probability mass on none of our candidate explanations being remotely close to the truth :)