Yes, it makes sense to consider the threat model, and your paper does a good job of making this explicit (as in Figure 2). We just wanted to prod around and see how things are working!
The way I’ve been thinking about refusal vs unlearning, say with respect to harmful content:
Refusal is like an implicit classifier, sitting in front of the model.
If the model implicitly classifies a prompt as harmful, it will go into its refuse-y mode.
This classification is vulnerable to jailbreaks—tricks that flip the classification, enabling harmful prompts to sneak past the classifier and elicit the model’s capability to generate harmful output.
Unlearning / circuit breaking aims to directly interfere with the model’s ability to generate harmful content.
Even if the refusal classifier is bypassed, the model is not capable of generating harmful outputs.
So in some way, I think of refusal as being shallow (a classifier on top, but the capability is still underneath), and unlearning / circuit breaking as being deep (trying to directly remove the capability itself).
[I don’t know how this relates to the consensus interpretation of these terms, but it’s how I personally have been thinking of things.]
Hey, late comment here but I thought it would be worth noting that we do actually acknowledge vulnerability to white-box attacks in the original paper and ran experiments to demonstrate this in Figure 14 (Figure 15 in the current version).
Thanks for the nice reply!
Yes, it makes sense to consider the threat model, and your paper does a good job of making this explicit (as in Figure 2). We just wanted to prod around and see how things are working!
The way I’ve been thinking about refusal vs unlearning, say with respect to harmful content:
Refusal is like an implicit classifier, sitting in front of the model.
If the model implicitly classifies a prompt as harmful, it will go into its refuse-y mode.
This classification is vulnerable to jailbreaks—tricks that flip the classification, enabling harmful prompts to sneak past the classifier and elicit the model’s capability to generate harmful output.
Unlearning / circuit breaking aims to directly interfere with the model’s ability to generate harmful content.
Even if the refusal classifier is bypassed, the model is not capable of generating harmful outputs.
So in some way, I think of refusal as being shallow (a classifier on top, but the capability is still underneath), and unlearning / circuit breaking as being deep (trying to directly remove the capability itself).
[I don’t know how this relates to the consensus interpretation of these terms, but it’s how I personally have been thinking of things.]
Hey, late comment here but I thought it would be worth noting that we do actually acknowledge vulnerability to white-box attacks in the original paper and ran experiments to demonstrate this in Figure 14 (Figure 15 in the current version).
It’s nice to see more work on this, though.