The deceptive element involved here feels like this will be in the category of alignment techniques that get increasingly hard to use successfully as model capabilities go up. Even at current capabilities, we know that models are exquisitely sensitive to context: they give a different answer to “How to build a bomb?” in a D&D context or a Minecraft context than a real-world context, and we know there are activation directions that trigger those common contexts. So when generating the synthetic fine-tuning data for this, I think you’d need to pay careful attention to covering the full range of contexts that the fact you’re trying to overwrite occurred in in the base model’s training data (or at least, all the contexts you actually care about the model’s behavior in, and at least most of the rest). So I think you’d need to put a lot of effort into generating your fine-tuning data-set: ideally I’d want access to the base model’s training set, a way to filter it down to all the documents relevant to the fact we want to unlearn, then take a sample of those with good coverage and carefully rewrite all of them to be consistent with our new fact, without leaving any fingerprints or making any stylistic changes during the rewriting process. At the filtering step, we want to find not just documents that explicitly state the fact we’re altering, but preferably also ones that are direct or indirect consequences of it (down to things like how cooks purchase habits would be different it baking required frozen butter) — how far we need to take weaving this carefully consistent web of counterfactuals presumably depends on the capacity of the model we’re trying to fool. I’m dubious we could consistently fool an ASI.
And there is some ongoing future work which is currently trying to meet this higher standard! (Or at least something similar to it.)
The deceptive element involved here feels like this will be in the category of alignment techniques that get increasingly hard to use successfully as model capabilities go up. Even at current capabilities, we know that models are exquisitely sensitive to context: they give a different answer to “How to build a bomb?” in a D&D context or a Minecraft context than a real-world context, and we know there are activation directions that trigger those common contexts. So when generating the synthetic fine-tuning data for this, I think you’d need to pay careful attention to covering the full range of contexts that the fact you’re trying to overwrite occurred in in the base model’s training data (or at least, all the contexts you actually care about the model’s behavior in, and at least most of the rest). So I think you’d need to put a lot of effort into generating your fine-tuning data-set: ideally I’d want access to the base model’s training set, a way to filter it down to all the documents relevant to the fact we want to unlearn, then take a sample of those with good coverage and carefully rewrite all of them to be consistent with our new fact, without leaving any fingerprints or making any stylistic changes during the rewriting process. At the filtering step, we want to find not just documents that explicitly state the fact we’re altering, but preferably also ones that are direct or indirect consequences of it (down to things like how cooks purchase habits would be different it baking required frozen butter) — how far we need to take weaving this carefully consistent web of counterfactuals presumably depends on the capacity of the model we’re trying to fool. I’m dubious we could consistently fool an ASI.