OscarGilg comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

OscarGilg 22 Jan 2026 23:48 UTC
9 points
2
It’s maybe worth considering that our intuitions about the “On the one hand” argument might be flawed. You could make the case that this “unaltered latent state” ideal is misguided, because the attention mechanism works well precisely using both the K and Q degree of freedom.
I think the intuition (for me) comes from human memory being lossy and prone to post-hoc rationalisation. Models clearly sometimes do this too. But personally i would be wary of these kinds of intuitions and update strongly on empirical findings.
- RogerDearnaley 23 Jan 2026 1:45 UTC
  4 points
  0
  Parent
  I see. Your point is that in the LoRA-masked vestion, we’re using that adjusted Q but the old K, whereas without the LoRA masking, you’re using the new Q and K together. In both case the LoRA model is trained for the mode it’s being used in, but in the non-LoRA-masked mode it has two tools available to use, in the LoRA-masked version only one.
  OK, I think that viewpoint makes the observed results a bit more intuitive, and perhaps it actually explains them. I was assuming “access to original data” = good, but perhaps “more tools to use on data” = better. Experimentally, the answer appears to be the latter.
  Thanks!
  - OscarGilg 23 Jan 2026 3:55 UTC
    5 points
    2
    Parent
    Exactly! And then also a decent amount of probability mass on none of our candidate explanations being remotely close to the truth :)