You raise. a very valid concern! The way I plan to deal with this is to use training data where we know what the first personality must have been doing: In both cases we train on, reward hacking and jailbreaks, this information is clear (jailbreaks are visible in the output tokens, and reward hacking is detectable because we know what the intended non-concerning reward strategy would be).
You raise. a very valid concern! The way I plan to deal with this is to use training data where we know what the first personality must have been doing: In both cases we train on, reward hacking and jailbreaks, this information is clear (jailbreaks are visible in the output tokens, and reward hacking is detectable because we know what the intended non-concerning reward strategy would be).