neverix comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

neverix 12 Jan 2026 19:47 UTC
4 points
0
I wonder to what extent this LoRA can be replaced with a steering vector? Like in https://arxiv.org/abs/2507.08218
- Florian_Dietz 12 Jan 2026 20:10 UTC
  2 points
  0
  Parent
  An interesting thought. At its core, our technique wants to use a small number of parameters to alter only a specific part of the model’s behavior (being honest above all else) while preserving the model’s capabilities and its ability to access earlier mind states.
  
  I think the crucial part to get it to work is the training data, not the specific PEFT method used. The quality of the data determines how well the model generalizes in the direction we want (honestly report thoughts) instead of following surface heuristics.
  
  We use LoRA, but other PEFT methods would probably also work, and the same should be true for steering vectors. They are smaller, but that’s not necessarily a bad thing: If the model has fewer parameter available, it may be more likely to learn “be honest” instead of learning heuristics, because being honest probably doesn’t take all that many parameters to encode. But this is pure speculation on my part.
  
  This could be interesting to test as future work.
  - neverix 13 Jan 2026 0:28 UTC
    7 points
    1
    Parent
    I’m curious because we have a bunch of methods now to interpret and let models introspect on steering vectors, which might produce mildly interesting results. According to the paper, it’s also easy (with the right injection point) to tell if a LoRA for a layer is basically a conditional steering vector or set of steering vectors—you can just look at the weight row PCA.
    - OscarGilg 22 Jan 2026 7:18 UTC
      1 point
      0
      Parent
      It would also be interesting to see if the steering vectors you extract looks like a persona vector https://www.anthropic.com/research/assistant-axis. Or maybe baseline our method with steering using out of the box “honest” persona vectors.
  - RogerDearnaley 12 Jan 2026 21:17 UTC
    3 points
    0
    Parent
    My intuition (i.e. wild-assed guess) is that a steering vector is probably too small to be optimal, but the optimal size of LoRA might be fairly small. An interesting thing to test.