Florian_Dietz comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Florian_Dietz 16 Jan 2026 14:46 UTC
2 points
1

if having access to the unadulterated thoughts of the base model* was advantageous, the LoRA weights can just learn to not mess them up

I agree in principle, but I’m worried that the training data might not be good enough. I’m worried that any training data that slips through our quality filtering would cause more damage if we don’t use lora-masking than if we do