Out of curiosity, does this work as a jailbreak or a way to get around guardrails RLHF’d in? I’m inclined to think there wouldn’t be a point, since it looks to me like you need to have a copy of the weights (and are thus working with an open-weights model with other ways of circumventing them, unless you’re working with a lab and have access to their proprietary models). I strongly suspect that is the case, but it’s worth asking!
If you have a bunch of jailbroken outputs and the soft prompt discovery works you probably could find a jailbreak soft prompt, and if the decomposition method works you could figure out mechanically what the soft prompt is behaviorally like and that might give you ideas for other jailbreaks. But yeah this doesn’t meaningfully increase the “someone jailbreaks an open source model” threat surface.
Out of curiosity, does this work as a jailbreak or a way to get around guardrails RLHF’d in? I’m inclined to think there wouldn’t be a point, since it looks to me like you need to have a copy of the weights (and are thus working with an open-weights model with other ways of circumventing them, unless you’re working with a lab and have access to their proprietary models). I strongly suspect that is the case, but it’s worth asking!
If you have a bunch of jailbroken outputs and the soft prompt discovery works you probably could find a jailbreak soft prompt, and if the decomposition method works you could figure out mechanically what the soft prompt is behaviorally like and that might give you ideas for other jailbreaks. But yeah this doesn’t meaningfully increase the “someone jailbreaks an open source model” threat surface.