Florian_Dietz comments on Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Florian_Dietz 11 Mar 2025 16:24 UTC
2 points
0
You raise. a very valid concern! The way I plan to deal with this is to use training data where we know what the first personality must have been doing: In both cases we train on, reward hacking and jailbreaks, this information is clear (jailbreaks are visible in the output tokens, and reward hacking is detectable because we know what the intended non-concerning reward strategy would be).