A common concern is that sufficiently capable models might just rederive anything that was unlearned by using general reasoning ability, tools, or related knowledge.
Is anyone working on “ignorance preservation” methods to achieve the equivalent of unlearning at this level of the stack, for the sake of defense-in-depth? What are possible research directions here?
You’re right that rederivation is a concern. But I think that the important question is: is this primarily a model-level problem that requires changing the weight, or more of a system-level concern that should be addressed through deployment controls?
Unlearning might not stop super capable systems from rederiving everything, but it probably makes it harder, forcing them to take longer, more explicit reasoning paths. This opens up new opportunities for intervention, including CoT monitoring or other runtime defenses.
Suppose you monitor the CoT and the model is rederiving something it’s not supposed to know. You could halt that particular CoT, but eventually you’re creating pressure for an instance where the model manages to escape notice, possibly through luck, possibly through steganography. See “Nearest Unblocked Strategy”.
That’s why I think CoT monitoring alone is insufficient, and people should be thinking about additional (runtime?) defenses. Curious to hear if you have any ideas!
I appreciate the thoughts here. But it’s not straightforward to me how halting the particular CoT would create an evolutionary pressure for the model, unless we’re using it as an optimization signal.
Awesome work.
Is anyone working on “ignorance preservation” methods to achieve the equivalent of unlearning at this level of the stack, for the sake of defense-in-depth? What are possible research directions here?
You’re right that rederivation is a concern. But I think that the important question is: is this primarily a model-level problem that requires changing the weight, or more of a system-level concern that should be addressed through deployment controls?
Unlearning might not stop super capable systems from rederiving everything, but it probably makes it harder, forcing them to take longer, more explicit reasoning paths. This opens up new opportunities for intervention, including CoT monitoring or other runtime defenses.
Suppose you monitor the CoT and the model is rederiving something it’s not supposed to know. You could halt that particular CoT, but eventually you’re creating pressure for an instance where the model manages to escape notice, possibly through luck, possibly through steganography. See “Nearest Unblocked Strategy”.
That’s why I think CoT monitoring alone is insufficient, and people should be thinking about additional (runtime?) defenses. Curious to hear if you have any ideas!
I appreciate the thoughts here. But it’s not straightforward to me how halting the particular CoT would create an evolutionary pressure for the model, unless we’re using it as an optimization signal.