Sohaib Imran comments on Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Sohaib Imran 1 Jun 2025 20:04 UTC
1 point
0
In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token
Minor nitpick, but why not create a new chat template instead with every message containing a user, assistant, and personality-shift assistant (or a less unwieldy name). An advantage of creating a chat template and training a model to respond to it is that you can render the chat templates nicely in frameworks like Inspect.