RogerDearnaley comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

RogerDearnaley 12 Jan 2026 21:53 UTC
4 points
0
I would definitely want to investigate this more, with different sizes of LoRAs and so forth. Other then the KV-cache issue, it isn’t that expensive to do both (or randomly sample), in case there are situations where one approach does better and others where the other does better (e.g. CoT obfuscated vs. not). But maybe we’ll find there really is a consistent answer and it’s just unintuitive.

This might also be an interesting thing to apply mech-interp analysis to: for the activations that are altered by the LoRA masking choice, what do the resulting activation changes look like, and how does that correlate with the content of the tokens they’re for?
- OscarGilg 22 Jan 2026 23:48 UTC
  9 points
  2
  Parent
  It’s maybe worth considering that our intuitions about the “On the one hand” argument might be flawed. You could make the case that this “unaltered latent state” ideal is misguided, because the attention mechanism works well precisely using both the K and Q degree of freedom.
  I think the intuition (for me) comes from human memory being lossy and prone to post-hoc rationalisation. Models clearly sometimes do this too. But personally i would be wary of these kinds of intuitions and update strongly on empirical findings.
  - RogerDearnaley 23 Jan 2026 1:45 UTC
    4 points
    0
    Parent
    I see. Your point is that in the LoRA-masked vestion, we’re using that adjusted Q but the old K, whereas without the LoRA masking, you’re using the new Q and K together. In both case the LoRA model is trained for the mode it’s being used in, but in the non-LoRA-masked mode it has two tools available to use, in the LoRA-masked version only one.
    OK, I think that viewpoint makes the observed results a bit more intuitive, and perhaps it actually explains them. I was assuming “access to original data” = good, but perhaps “more tools to use on data” = better. Experimentally, the answer appears to be the latter.
    Thanks!
    What links here?
    How Hard a Problem is Alignment? by RogerDearnaley (3 Feb 2026 5:03 UTC; 6 points)
    - OscarGilg 23 Jan 2026 3:55 UTC
      5 points
      2
      Parent
      Exactly! And then also a decent amount of probability mass on none of our candidate explanations being remotely close to the truth :)