A team of researchers at EleutherAI, UK AISI, and Oxford, along with me, released a paper on pretraining data filtering. The TLDR is that simple pretraining data filtering seems like an effective technique for preventing unsafe knowledge, though there are limitations and open questions. See our full paper, articles, and models at https://deepignorance.ai/
Overview:
Today’s LLM safeguards focus on suppressing unsafe knowledge in post-training, often via refusal training. The unsafe knowledge remains in the model’s weights. However, these safeguards are ineffective for open-weight models since bad actors can easily remove them through fine-tuning.
We explore an intuitive yet understudied question: Can we prevent LLMs from learning unsafe technical capabilities (such as biorisk) by filtering out enough of the relevant pretraining data before we begin training a model? Even a fully jailbroken model is unlikely to be helpful if it is deeply ignorant of dangerous knowledge. For example, it may be the case that a model that does not know how to make a bomb is unlikely to be helpful even if it never refuses bomb-related prompts.
We train multiple 6.9B LLMs from scratch on an unfiltered dataset and on filtered versions where we filtered out biorisk knowledge. We are one of only a handful of papers outside of frontier companies that train LLMs from scratch. We observe three main results:
1. Knowledge Prevention: The filtered models perform significantly worse on our biorisk knowledge evaluations, nearly at random chance. Crucially, filtering does not lead to notable regressions in general knowledge. These results suggest that data filtering may be a simple way to prevent models from learning dangerous capabilities without sacrificing utility.
2. Tamper-Resistance: Open-weight models can be fine-tuned by downstream users on biorisk data. We study this attack by fine-tuning our models on 300M tokens of high-quality biorisk-related documents. We find that performance can improve, but that it is still well below the no-filtering baseline. Data filtering is significantly more tamper-resistant than current safeguards.
3. Defense-in-Depth: We demonstrate that data filtering cannot prevent LLMs from leveraging harmful knowledge provided in-context, but that Circuit-Breaking-based techniques offer complementary defenses. However, we show that none of the defenses we test are resistant to staged attacks that combine fine-tuning and in-context retrieval.
Taken together, these results suggest that rigorous pretraining data filtering is a promising method for preventing acquisition of dangerous technical capabilities without obvious degradation in overall model utility. There are still many limitations and open questions, however. This post only skims the surface of our results and their implications!
While this might prevent models from giving instructions on e. g. building a bomb in response to a literal “how do I build a bomb?” prompt, they may still be able to infer bomb-building instructions from first principles using their general knowledge in response to e. g. “how do I build a compact device that omnidirectionally releases a lot of energy in under 1 sec?”. More generally, while this removes the explicit knowledge of what counts as a biothreat/bomb/etc., it doesn’t necessarily prevent the model from reconstructing that knowledge. The only way to prevent that would be to hobble the models’ general capabilities: either removing all physics/chemistry knowledge, or hobbling their general reasoning/problem-solving skills (to remove their ability to do clever reconstructions).
Now, arguably, LLMs may be forever unable to actually do this sort of “reconstruction”, since it’s an innovation-like task and they’ve been pretty bad at this so far. But:
This is obviously a concern if I’m wrong and LLMs are AGI-complete. In this case, they would eventually attain the skills for doing this kind of reconstruction, and only the aforementioned drastic hobbling would prevent it.
This might be less innovation-flavoured than it seems, because general-knowledge datasets might still contain the “shadows” of the harmful knowledge, or disparate pieces of it, such that its reconstruction is easy. (E. g., general physics knowledge + various scientific experiments and engineering principles + fiction and news articles which mention bombs.)
So: have you tried asking for instructions “indirectly”?
(Though even if this suffices to prevent reconstruction in the relatively dumb models of today, it might not work for tomorrow’s cleverer reasoners.
The core issue is the same as in e. g. Deep Deceptiveness: the circuits the model would use to reconstruct harmful knowledge are the same circuits that make it useful in other contexts, so you can’t have one without the other.)
I wonder how well this holds up in other domains. I don’t think there is any realistic, policy path forward for what I’m about to say, but I shall say it anyway: It would be nice if, in our current attempts at building AGI, we filtered all data about programming/hacking/coding to reduce escape risks. ASI still outwits us and escapes in this scenario, but perhaps it would widen our opportunity window for stopping a dangerous not-yet-super-intelligent AGI.
I’m doubtful of the policy path forward here because the coding capabilities of these systems are one of the top economic incentives for their current existence.
Also, on the subject of filtering, I’ve wondered for a while now if it wouldn’t be a good idea to filter all training data about AI alignment and stories about AI causing mass destruction. Obviously, an ASI could get far theorizing about these fields/possibilities without that data, but maybe its absence would stall a not-yet-super-intelligent AGI.
I’m also interested in these questions! I’m particularly interested in exploring how effectively we can filter out offensive cyber knowledge without compromising non-security software engineering skills. Since so many of our proposed control protocols are cyber-based, limiting the cyber knowledge of our models seems like it would help with AI control. This is plausibly difficult, but I also thought the same about biology. Regardless, I think that it would be such a big win if we pulled it off that it’s worth studying.
I’m actively studying the degree to which AI safety discourse is present in popular pretraining datasets, with discussions of control and evaluation protocols. The hope is then that we could train models with this knowledge filtered out and see how the model’s behavior changes. This work is still in the early stages, though. An ambitious version of this is to perform empirical experiments studying Self-Fulfilling Misalignment.
I think my main point is that there is a lot of low-hanging fruit here! It’s plausible that there are more conceptually simple interventions we can make, like Deep Ignorance, that in aggregate buy us some safety. I think it’s worth it for the community to put a lot more effort into pretraining research.
Anthropic recently released a pretraining data filtering paper with similar results to Deep Ignorance. It is very exciting that both teams arrived at the same broad conclusion, even with differences in methodologies. It becomes more difficult to square these positive results with data filtering against OpenAI’s negative results. We need more public and fully transparent research into pretraining interventions. I’m especially excited to study scaling laws for pretraining filtering.
The interesting flip side to this seems to me that any model that is so easily kept from dangerous knowledge just isn’t a very smart model, almost surprisingly so. After all, if you know about all sorts of chemistry or biology, how are you going to be unable to generalize to the notion of explosives or bioweapons? A truly general intelligence could do the leap very easily.
Thanks for sharing this paper; this also reminded me of a paper, A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (https://arxiv.org/pdf/2305.13169), and their section on toxicity filtering (threshold, classifier vs generation trade-off)
A team of researchers at EleutherAI, UK AISI, and Oxford, along with me, released a paper on pretraining data filtering. The TLDR is that simple pretraining data filtering seems like an effective technique for preventing unsafe knowledge, though there are limitations and open questions. See our full paper, articles, and models at https://deepignorance.ai/
Overview:
Today’s LLM safeguards focus on suppressing unsafe knowledge in post-training, often via refusal training. The unsafe knowledge remains in the model’s weights. However, these safeguards are ineffective for open-weight models since bad actors can easily remove them through fine-tuning.
We explore an intuitive yet understudied question: Can we prevent LLMs from learning unsafe technical capabilities (such as biorisk) by filtering out enough of the relevant pretraining data before we begin training a model? Even a fully jailbroken model is unlikely to be helpful if it is deeply ignorant of dangerous knowledge. For example, it may be the case that a model that does not know how to make a bomb is unlikely to be helpful even if it never refuses bomb-related prompts.
We train multiple 6.9B LLMs from scratch on an unfiltered dataset and on filtered versions where we filtered out biorisk knowledge. We are one of only a handful of papers outside of frontier companies that train LLMs from scratch. We observe three main results:
1. Knowledge Prevention: The filtered models perform significantly worse on our biorisk knowledge evaluations, nearly at random chance. Crucially, filtering does not lead to notable regressions in general knowledge. These results suggest that data filtering may be a simple way to prevent models from learning dangerous capabilities without sacrificing utility.
2. Tamper-Resistance: Open-weight models can be fine-tuned by downstream users on biorisk data. We study this attack by fine-tuning our models on 300M tokens of high-quality biorisk-related documents. We find that performance can improve, but that it is still well below the no-filtering baseline. Data filtering is significantly more tamper-resistant than current safeguards.
3. Defense-in-Depth: We demonstrate that data filtering cannot prevent LLMs from leveraging harmful knowledge provided in-context, but that Circuit-Breaking-based techniques offer complementary defenses. However, we show that none of the defenses we test are resistant to staged attacks that combine fine-tuning and in-context retrieval.
Taken together, these results suggest that rigorous pretraining data filtering is a promising method for preventing acquisition of dangerous technical capabilities without obvious degradation in overall model utility. There are still many limitations and open questions, however. This post only skims the surface of our results and their implications!
Have you considered the following concern?:
While this might prevent models from giving instructions on e. g. building a bomb in response to a literal “how do I build a bomb?” prompt, they may still be able to infer bomb-building instructions from first principles using their general knowledge in response to e. g. “how do I build a compact device that omnidirectionally releases a lot of energy in under 1 sec?”. More generally, while this removes the explicit knowledge of what counts as a biothreat/bomb/etc., it doesn’t necessarily prevent the model from reconstructing that knowledge. The only way to prevent that would be to hobble the models’ general capabilities: either removing all physics/chemistry knowledge, or hobbling their general reasoning/problem-solving skills (to remove their ability to do clever reconstructions).
Now, arguably, LLMs may be forever unable to actually do this sort of “reconstruction”, since it’s an innovation-like task and they’ve been pretty bad at this so far. But:
This is obviously a concern if I’m wrong and LLMs are AGI-complete. In this case, they would eventually attain the skills for doing this kind of reconstruction, and only the aforementioned drastic hobbling would prevent it.
This might be less innovation-flavoured than it seems, because general-knowledge datasets might still contain the “shadows” of the harmful knowledge, or disparate pieces of it, such that its reconstruction is easy. (E. g., general physics knowledge + various scientific experiments and engineering principles + fiction and news articles which mention bombs.)
So: have you tried asking for instructions “indirectly”?
(Though even if this suffices to prevent reconstruction in the relatively dumb models of today, it might not work for tomorrow’s cleverer reasoners.
The core issue is the same as in e. g. Deep Deceptiveness: the circuits the model would use to reconstruct harmful knowledge are the same circuits that make it useful in other contexts, so you can’t have one without the other.)
I wonder how well this holds up in other domains. I don’t think there is any realistic, policy path forward for what I’m about to say, but I shall say it anyway: It would be nice if, in our current attempts at building AGI, we filtered all data about programming/hacking/coding to reduce escape risks. ASI still outwits us and escapes in this scenario, but perhaps it would widen our opportunity window for stopping a dangerous not-yet-super-intelligent AGI.
I’m doubtful of the policy path forward here because the coding capabilities of these systems are one of the top economic incentives for their current existence.
Also, on the subject of filtering, I’ve wondered for a while now if it wouldn’t be a good idea to filter all training data about AI alignment and stories about AI causing mass destruction. Obviously, an ASI could get far theorizing about these fields/possibilities without that data, but maybe its absence would stall a not-yet-super-intelligent AGI.
I’m also interested in these questions! I’m particularly interested in exploring how effectively we can filter out offensive cyber knowledge without compromising non-security software engineering skills. Since so many of our proposed control protocols are cyber-based, limiting the cyber knowledge of our models seems like it would help with AI control. This is plausibly difficult, but I also thought the same about biology. Regardless, I think that it would be such a big win if we pulled it off that it’s worth studying.
I’m actively studying the degree to which AI safety discourse is present in popular pretraining datasets, with discussions of control and evaluation protocols. The hope is then that we could train models with this knowledge filtered out and see how the model’s behavior changes. This work is still in the early stages, though. An ambitious version of this is to perform empirical experiments studying Self-Fulfilling Misalignment.
I think my main point is that there is a lot of low-hanging fruit here! It’s plausible that there are more conceptually simple interventions we can make, like Deep Ignorance, that in aggregate buy us some safety. I think it’s worth it for the community to put a lot more effort into pretraining research.
Anthropic recently released a pretraining data filtering paper with similar results to Deep Ignorance. It is very exciting that both teams arrived at the same broad conclusion, even with differences in methodologies. It becomes more difficult to square these positive results with data filtering against OpenAI’s negative results. We need more public and fully transparent research into pretraining interventions. I’m especially excited to study scaling laws for pretraining filtering.
https://alignment.anthropic.com/2025/pretraining-data-filtering/
The interesting flip side to this seems to me that any model that is so easily kept from dangerous knowledge just isn’t a very smart model, almost surprisingly so. After all, if you know about all sorts of chemistry or biology, how are you going to be unable to generalize to the notion of explosives or bioweapons? A truly general intelligence could do the leap very easily.
Thanks for sharing this paper; this also reminded me of a paper, A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (https://arxiv.org/pdf/2305.13169), and their section on toxicity filtering (threshold, classifier vs generation trade-off)