neverix comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

neverix 13 Jan 2026 0:28 UTC
7 points
1
I’m curious because we have a bunch of methods now to interpret and let models introspect on steering vectors, which might produce mildly interesting results. According to the paper, it’s also easy (with the right injection point) to tell if a LoRA for a layer is basically a conditional steering vector or set of steering vectors—you can just look at the weight row PCA.
- OscarGilg 22 Jan 2026 7:18 UTC
  1 point
  0
  Parent
  It would also be interesting to see if the steering vectors you extract looks like a persona vector https://www.anthropic.com/research/assistant-axis. Or maybe baseline our method with steering using out of the box “honest” persona vectors.