RogerDearnaley comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

RogerDearnaley 12 Jan 2026 21:15 UTC
2 points
0
Given the proliferation of prompt engineering techniques in recent years, this suggests that there is potential for further improvement by optimizing the [Intervention] string.
For example, the various automated jailbreaking techniques used to optimize semantic/roleplay jailbreaks could be applied instead to optimize [Intervention] strings. In a very high dimensional space (like the space of all intervention strings) being the “attacker” and putting the misaligned model in the role of defender should be a strong position.

[FWIW, that is basically the project I proposed to my MATS mentor — sadly we ended up doing something else, but anyone interested is welcome to run with this.]