You might be interested in an Appendix with further evaluations about whether LLMs really believe what they are told, which provides some evidence that the false fact is not believed in a very superficial way.
While I think these evals provide decent evidence about what the LLMs in the experiment “actually believe” (I don’t expect there to be natural situations where these weak models figure out the facts are false in a single forward pass when they “try harder”), I am pessimistic that we would be able to trust this sort of behavioral evals against powerful and situationally aware LLMs.
One eval which I think would give much stronger guarantees even against powerful and situationally aware LLMs is iid training (teach 100 false facts, train the model to distinguish true from false facts using 50 of them, and see if you generalize to the other 50), similar to what Aghyad and I did in The case for unlearning that removes information from LLM weights.
I think this is a much harder standard to meet (I think the models in the experiments in the blog post don’t meet this standard, just like the LLMs that were taught false facts in my unlearning paper), but I think it gives you the guarantees you want (because I think iid training is very sample efficient at making LLMs “actually try”).
I am excited about further research that tries to teach false facts in a way that actually meets this standard!
The deceptive element involved here feels like this will be in the category of alignment techniques that get increasingly hard to use successfully as model capabilities go up. Even at current capabilities, we know that models are exquisitely sensitive to context: they give a different answer to “How to build a bomb?” in a D&D context or a Minecraft context than a real-world context, and we know there are activation directions that trigger those common contexts. So when generating the synthetic fine-tuning data for this, I think you’d need to pay careful attention to covering the full range of contexts that the fact you’re trying to overwrite occurred in in the base model’s training data (or at least, all the contexts you actually care about the model’s behavior in, and at least most of the rest). So I think you’d need to put a lot of effort into generating your fine-tuning data-set: ideally I’d want access to the base model’s training set, a way to filter it down to all the documents relevant to the fact we want to unlearn, then take a sample of those with good coverage and carefully rewrite all of them to be consistent with our new fact, without leaving any fingerprints or making any stylistic changes during the rewriting process. At the filtering step, we want to find not just documents that explicitly state the fact we’re altering, but preferably also ones that are direct or indirect consequences of it (down to things like how cooks purchase habits would be different it baking required frozen butter) — how far we need to take weaving this carefully consistent web of counterfactuals presumably depends on the capacity of the model we’re trying to fool. I’m dubious we could consistently fool an ASI.
You might be interested in an Appendix with further evaluations about whether LLMs really believe what they are told, which provides some evidence that the false fact is not believed in a very superficial way.
While I think these evals provide decent evidence about what the LLMs in the experiment “actually believe” (I don’t expect there to be natural situations where these weak models figure out the facts are false in a single forward pass when they “try harder”), I am pessimistic that we would be able to trust this sort of behavioral evals against powerful and situationally aware LLMs.
One eval which I think would give much stronger guarantees even against powerful and situationally aware LLMs is iid training (teach 100 false facts, train the model to distinguish true from false facts using 50 of them, and see if you generalize to the other 50), similar to what Aghyad and I did in The case for unlearning that removes information from LLM weights.
I think this is a much harder standard to meet (I think the models in the experiments in the blog post don’t meet this standard, just like the LLMs that were taught false facts in my unlearning paper), but I think it gives you the guarantees you want (because I think iid training is very sample efficient at making LLMs “actually try”).
I am excited about further research that tries to teach false facts in a way that actually meets this standard!
And there is some ongoing future work which is currently trying to meet this higher standard! (Or at least something similar to it.)
The deceptive element involved here feels like this will be in the category of alignment techniques that get increasingly hard to use successfully as model capabilities go up. Even at current capabilities, we know that models are exquisitely sensitive to context: they give a different answer to “How to build a bomb?” in a D&D context or a Minecraft context than a real-world context, and we know there are activation directions that trigger those common contexts. So when generating the synthetic fine-tuning data for this, I think you’d need to pay careful attention to covering the full range of contexts that the fact you’re trying to overwrite occurred in in the base model’s training data (or at least, all the contexts you actually care about the model’s behavior in, and at least most of the rest). So I think you’d need to put a lot of effort into generating your fine-tuning data-set: ideally I’d want access to the base model’s training set, a way to filter it down to all the documents relevant to the fact we want to unlearn, then take a sample of those with good coverage and carefully rewrite all of them to be consistent with our new fact, without leaving any fingerprints or making any stylistic changes during the rewriting process. At the filtering step, we want to find not just documents that explicitly state the fact we’re altering, but preferably also ones that are direct or indirect consequences of it (down to things like how cooks purchase habits would be different it baking required frozen butter) — how far we need to take weaving this carefully consistent web of counterfactuals presumably depends on the capacity of the model we’re trying to fool. I’m dubious we could consistently fool an ASI.