williawa comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

williawa 16 Jan 2026 8:28 UTC
1 point
0
finetuning typically only changes the weights a little bit, especially with lora. this means that for quadratic (and higher-order) interactions between weight tensors, most of the effect is in the first-order terms capturing how the interaction changes when one tensor is changed but the others stay fixed
This is kind of what I thought as well.
I mean, like, I’m curious about how “chaotic” the residual stream is.
Like the reason the LoRA patch seemed promising to me initially was because I thought of the residual stream as very chaotic. Like if we try to use normal LoRA, it will instantly cause the model to have very divergent thoughts from the untrained model.
But if this is not true, then maybe it doesn’t matter. Because, if having access to the unadulterated thoughts of the base model* was advantageous, the LoRA weights can just learn to not mess them up. (and this is not hard for it to do)

* (model that has been subject to pretraining, instruct tuning and RLHF, and maybe several other stages of training, but has not been subject to our additional very specific SP training. feel there’s need for a new word here.)
- Florian_Dietz 16 Jan 2026 14:46 UTC
  2 points
  1
  Parent
  
  if having access to the unadulterated thoughts of the base model* was advantageous, the LoRA weights can just learn to not mess them up
  
  I agree in principle, but I’m worried that the training data might not be good enough. I’m worried that any training data that slips through our quality filtering would cause more damage if we don’t use lora-masking than if we do