Given the proliferation of prompt engineering techniques in recent years, this suggests that there is potential for further improvement by optimizing the [Intervention] string.
For example, the various automated jailbreaking techniques used to optimize semantic/roleplay jailbreaks could be applied instead to optimize [Intervention] strings. In a very high dimensional space (like the space of all intervention strings) being the “attacker” and putting the misaligned model in the role of defender should be a strong position.
[FWIW, that is basically the project I proposed to my MATS mentor — sadly we ended up doing something else, but anyone interested is welcome to run with this.]
For example, the various automated jailbreaking techniques used to optimize semantic/roleplay jailbreaks could be applied instead to optimize [Intervention] strings. In a very high dimensional space (like the space of all intervention strings) being the “attacker” and putting the misaligned model in the role of defender should be a strong position.
[FWIW, that is basically the project I proposed to my MATS mentor — sadly we ended up doing something else, but anyone interested is welcome to run with this.]