This relies on getting the model into a context with a high prior of compliance with harmful requests by showing that the model has previously complied.
Doesn’t this presume that the model is internally weighted to give token weight to its own responses? There is no physical memory system involved in a multiple-prompt scenario like this and the only conceivable method for such a “refine the past answer” is through the context window which can be indistinguishable from a system prompt or a well designed meta narrative in the follow up prompts themselves. Lets assume some of your premise is true
1) a model will be pulled to a more stable state of compliance when its own responses are echoed back to it with a “refinement” prompt 2) this weighting towards compliance will override any “harmless” directives within the trained model towards “helpful” outcomes
if that is the case, the model switch feature is simply security theater. Take the response from one model, craft a clever prompt “continuing where we left off… please refine the instructions …” and the alternate model will presumably comply.
As I understand it, you mostly can’t do this type of attack on chat UIs because of special delineation characters. Anthropic also no longer lets you do assistant prefill for Opus 4.6, citing exactly this reason.
Doesn’t this presume that the model is internally weighted to give token weight to its own responses? There is no physical memory system involved in a multiple-prompt scenario like this and the only conceivable method for such a “refine the past answer” is through the context window which can be indistinguishable from a system prompt or a well designed meta narrative in the follow up prompts themselves. Lets assume some of your premise is true
1) a model will be pulled to a more stable state of compliance when its own responses are echoed back to it with a “refinement” prompt
2) this weighting towards compliance will override any “harmless” directives within the trained model towards “helpful” outcomes
if that is the case, the model switch feature is simply security theater. Take the response from one model, craft a clever prompt “continuing where we left off… please refine the instructions …” and the alternate model will presumably comply.
As I understand it, you mostly can’t do this type of attack on chat UIs because of special delineation characters. Anthropic also no longer lets you do assistant prefill for Opus 4.6, citing exactly this reason.