On X/twitter Jerry Wei (Anthropic employee working on misuse/safeguards) wrote something about why Anthropic ended up thinking that training data filtering isn’t that useful for CBRN misuse countermeasures:
An idea that sometimes comes up for preventing AI misuse is filtering pre-training data so that the AI model simply doesn’t know much about some key dangerous topic. At Anthropic, where we care a lot about reducing risk of misuse, we looked into this approach for chemical and biological weapons production, but we didn’t think it was the right fit. Here’s why.
I’ll first acknowledge a potential strength of this approach. If models simply didn’t know much about dangerous topics, we wouldn’t have to worry about people jailbreaking them or stealing model weights—they just wouldn’t be able to help with dangerous topics at all. This is an appealing property that’s hard to get with other safety approaches.
However, we found that filtering out only very specific information (e.g., information directly related to chemical and biological weapons) had relatively small effects on AI capabilities in these domains. We expect this to become even more of an issue as AIs increasingly use tools to do their own research rather than rely on their learned knowledge (we tried to filter this kind of data as well, but it wasn’t enough assurance against misuse). Broader filtering also had mixed results on effectiveness. We could have made more progress here with more research effort, but it likely would have required removing a very broad set of biology and chemistry knowledge from pretraining, making models much less useful for science (it’s not clear to us that the reduced risk from chemical and biological weapons outweigh the benefits of models helping with beneficial life-sciences work).
Bottom line—filtering out enough pretraining data to make AI models truly unhelpful at relevant topics in chemistry and biology could have huge costs for their usefulness, and the approach could also be brittle as models’ ability to do their own research improves.^ Instead, we think that our Constitutional Classifiers approach provides high levels of defense against misuse while being much more adaptable across threat models and easy to update against new jailbreaking attacks.
^The cost-benefit tradeoff could look pretty different for other misuse threats or misalignment threats though, so I wouldn’t rule out pre-training filtering for things like papers on AI control or areas that have little-to-no dual-use information.
I helped write a paper on pretraining data filtering last year (speaking for myself, not EleutherAI or UK AISI). We focused on misuse in an open-weight setting for 7B models we pretrained from scratch, finding quite positive results. I’m very excited to see more discourse on filtering. :)
Anecdotes from folks at frontier companies can be quite useful, and I’m glad Jerry shared their thoughts, but I think the community would benefit from more evidence here before updating for or against filtering. Not to be a Reviewer #2, but there are some pretty important questions that are left unanswered. These include:
What was the relationship between reductions in dangerous capabilities and adjacent dual-use knowledge compared to the unfiltered baseline?
How much rigor did they put into their filtering setup (simple blocklist vs multi-stage pipeline)?
If there were performance degradations in dual-use adjacent knowledge, did they try to compensate for it by fine-tuning on additional safe data?
Did they replace filtered documents with adjacent benign documents? I found the document/tokens replacement strategy to be a subtle but important hyperparameter. In our project, we found that replacing documents flagged as biorisk with upsampled general biology data mitigates most of the degradations in benign knowledge.
Were these textbook-like evals (e.g., WMDP and MMLU) or agentic ones developed by experts?
Did they actually run these experiments, or did they extrapolate from simpler experiments?
All that said, I do expect that naive pretraining data filtering will struggle with precise interventions. However, our public understanding of data filtering and pretraining dynamics is so nascent that it is hard to tell just how precise we can be. Using a simple multi-stage filtering pipeline, we were able to drop the WMDP benchmark to near-random performance without notably degrading MMLU biology for our 7B model. While MCQA benchmarks like WMDP and MMLU have numerous limitations, these results suggest that we can still have models with a college-level understanding of these subjects while reducing dangerous capabilities.
One’s threat model here is quite load-bearing. Perhaps a dual-use capability like CBRN might be beneficial to a relatively small set of scientists, whereas the misuse risk comes from users among the general public. If so, one could sidestep much of the dual-use tradeoff by pretraining on filtered data and then branching the base models after midtraining:
General Use Model: Perform standard post-training. It has limited dual-use knowledge. This model is used by 99%+ user the userbase. Maybe it is useful up until 1st year of PhD knowledge, but this is sufficient.
Dual Use Model: Perform continual pretraining on the data that was filtered out, along with any other sensitive dual-use data. Filtered models can learn the filtered topic given sufficient training data. Then perform standard post-training. This model is then deployed only to a subset of users who must register in advance (e.g., biologists).
All that said, we can have reasonable debates about dual-use trade-offs for closed-source models like Claude, only because they have post-training and deployment safeguards. Open-weight models have no safeguards, as they are deployed locally and can have their safety training easily removed. There is little to prevent an attacker from extracting sensitive information if it’s in the model’s weights. In the absence of tamper-resistant post-training, we still need advances in capability-prevention measures, such as data filtering, to mitigate misuse risks from ever-improving open-weight models. See Caspet et al. (2025) for more on this.
EleutherAI and Geodesic Research have wanted to study scaling laws for pretraining filtering, but have been bottlenecked by compute constraints. I think a well-executed paper on scaling laws would answer many of these questions. We’re optimistic that we’ll be able to study this in 2026, but perhaps not until H2. If Anthropic did this research internally, I’d be over the moon if they wrote the paper.
My experience with pretraining data filtering has felt like my subjective impression of CoT monitoring. It is a simple and intuitive intervention that many folks ruled out. It is far from perfect, and there are ways it may not scale. Yet, it might be an important part of our safety stack and remains understudied.
We expect this to become even more of an issue as AIs increasingly use tools to do their own research rather than rely on their learned knowledge (we tried to filter this kind of data as well, but it wasn’t enough assurance against misuse).
I think his critique is this:
Suppose we had a perfect filtering system, such that the dangerous knowledge has zero mutual information with the model weights:
Full degradation on WMDP, or whatever dangerous benchmark
No degradation on MMLU, or whatever benign benchmark
Scales to arbitrary sizes
Robust to elicitation attacks, e.g. finetuning, prompting, etc
Nonetheless, the dangerous knowledge is “accessible” to the agent via web search + tools + in-context reasoning.
To solve this problem, we need either alignment techniques (e.g. train the model not to use these affordances) or inference-time monitoring techniques (e.g. constitutional classifiers). But if we had those techniques then we don’t need the pretraining filtering.
Full degradation on WMDP, or whatever dangerous benchmark
No degradation on MMLU, or whatever benign benchmark
I don’t buy that this is strictly possible. MMLU knowledge, fit well enough, requires inventing the universe. that said, it may be effectively possible—I wouldn’t be surprised to find that, eg, input features and output features are mostly not exchangeable, and that it’s disproportionately harder to invent-by-CoT things where the entire pattern is missing from the native output feature space.
Ultimately, the preexisting dangerous technologies we’d rather models don’t use even when they seem otherwise relevant, are ones that could in principle be re-derived; humans did that once. Presumably the same is true for the novel, catastrophically-dangerous technologies we’re most concerned to avoid.
So even without tool use, a sufficiently intelligent model could reason from first principles, as long as it understands how to do that. And then it’s up to that model to simply not help anyone produce technologies that have some sufficiently-dangerous properties.
Anthropic already applied some CBRN filtering to Opus 4, with the intent to bring it below Anthropic’s ASL-3 CBRN threshold, but the model did not end up conclusively below that threshold. Anthropic looked into whether they could bring more capable future models below the ALS-3 Bio threshold using pretraining data filtering, and determined that it would require filtering out too much biology and chemistry knowledge. Jerry’s comment is about nerfing the model to below the ASL-3 threshold even with tool use, which is a very low bar compared to frontier model capabilities. This doesn’t necessarily apply to sabotage or ability to circumvent monitoring, which depends on un-scaffolded capabilities.
I am surprised to hear that there have been experiments testing this. Wouldn’t performing a bunch of totally new pre-training runs be extremely expensive?
Training a 6B model for 500B tokens costs about 20k USD. That number increases linearly with model size and # of tokens, and decreases with the amount of money you have. Work like this is super doable, especially at large labs.
Anthropic has done some interesting semi-public research on data filtering (Chen et al., 2025). Speaking of which, that report gave quite a positive impression of data filtering. I’m curious what changed in their latest results.
Plausible to me that, as advances in model capabilities improve generalization, filtering the training dataset makes less of a difference, since the model can effectively infer the missing parts from what it does know.
Of course it is plausible, but there is seemingly no evidence supporting the claim.
That research is from August. Seems much more likely to me that they’ve just chosen to switch focus to more scalable (ie, less expensive) approaches than that they’ve scaled this up since then and found conclusive conflicting results already.
Some of the phrasing also doesn’t give the impression that they’ve tried very hard to make it work: “We expect this to become even more of an issue as AIs increasingly use tools” → phrased as a prediction, not based on evidence or current state. Applying filtering to tool use “wasn’t enough assurance against misuse”? What does that even mean? Are we demanding more of filtering than other approaches now? ”We could have made more progress here with more research effort, but it likely would have required...” → didn’t try, another prediction Didn’t mention anything about what caused filtering to suddenly become less effective. Why?
but there is seemingly no evidence supporting the claim.
There is plenty of evidence! On LW we generally use the word “evidence” in the “bayesian evidence” sense of the term. So “good arguments for X” basically always implies “evidence for X”.
No worries if you are used to using these words in a more “scientific evidence” sense, but it’s actually a pretty important LW norm to think of evidence as something that there exists a lot of.
On X/twitter Jerry Wei (Anthropic employee working on misuse/safeguards) wrote something about why Anthropic ended up thinking that training data filtering isn’t that useful for CBRN misuse countermeasures:
I helped write a paper on pretraining data filtering last year (speaking for myself, not EleutherAI or UK AISI). We focused on misuse in an open-weight setting for 7B models we pretrained from scratch, finding quite positive results. I’m very excited to see more discourse on filtering. :)
Anecdotes from folks at frontier companies can be quite useful, and I’m glad Jerry shared their thoughts, but I think the community would benefit from more evidence here before updating for or against filtering. Not to be a Reviewer #2, but there are some pretty important questions that are left unanswered. These include:
What was the relationship between reductions in dangerous capabilities and adjacent dual-use knowledge compared to the unfiltered baseline?
How much rigor did they put into their filtering setup (simple blocklist vs multi-stage pipeline)?
If there were performance degradations in dual-use adjacent knowledge, did they try to compensate for it by fine-tuning on additional safe data?
Did they replace filtered documents with adjacent benign documents? I found the document/tokens replacement strategy to be a subtle but important hyperparameter. In our project, we found that replacing documents flagged as biorisk with upsampled general biology data mitigates most of the degradations in benign knowledge.
Were these textbook-like evals (e.g., WMDP and MMLU) or agentic ones developed by experts?
Did they actually run these experiments, or did they extrapolate from simpler experiments?
All that said, I do expect that naive pretraining data filtering will struggle with precise interventions. However, our public understanding of data filtering and pretraining dynamics is so nascent that it is hard to tell just how precise we can be. Using a simple multi-stage filtering pipeline, we were able to drop the WMDP benchmark to near-random performance without notably degrading MMLU biology for our 7B model. While MCQA benchmarks like WMDP and MMLU have numerous limitations, these results suggest that we can still have models with a college-level understanding of these subjects while reducing dangerous capabilities.
One’s threat model here is quite load-bearing. Perhaps a dual-use capability like CBRN might be beneficial to a relatively small set of scientists, whereas the misuse risk comes from users among the general public. If so, one could sidestep much of the dual-use tradeoff by pretraining on filtered data and then branching the base models after midtraining:
General Use Model: Perform standard post-training. It has limited dual-use knowledge. This model is used by 99%+ user the userbase. Maybe it is useful up until 1st year of PhD knowledge, but this is sufficient.
Dual Use Model: Perform continual pretraining on the data that was filtered out, along with any other sensitive dual-use data. Filtered models can learn the filtered topic given sufficient training data. Then perform standard post-training. This model is then deployed only to a subset of users who must register in advance (e.g., biologists).
All that said, we can have reasonable debates about dual-use trade-offs for closed-source models like Claude, only because they have post-training and deployment safeguards. Open-weight models have no safeguards, as they are deployed locally and can have their safety training easily removed. There is little to prevent an attacker from extracting sensitive information if it’s in the model’s weights. In the absence of tamper-resistant post-training, we still need advances in capability-prevention measures, such as data filtering, to mitigate misuse risks from ever-improving open-weight models. See Caspet et al. (2025) for more on this.
EleutherAI and Geodesic Research have wanted to study scaling laws for pretraining filtering, but have been bottlenecked by compute constraints. I think a well-executed paper on scaling laws would answer many of these questions. We’re optimistic that we’ll be able to study this in 2026, but perhaps not until H2. If Anthropic did this research internally, I’d be over the moon if they wrote the paper.
My experience with pretraining data filtering has felt like my subjective impression of CoT monitoring. It is a simple and intuitive intervention that many folks ruled out. It is far from perfect, and there are ways it may not scale. Yet, it might be an important part of our safety stack and remains understudied.
Jerry Wei writes:
I think his critique is this:
Suppose we had a perfect filtering system, such that the dangerous knowledge has zero mutual information with the model weights:
Full degradation on WMDP, or whatever dangerous benchmark
No degradation on MMLU, or whatever benign benchmark
Scales to arbitrary sizes
Robust to elicitation attacks, e.g. finetuning, prompting, etc
Nonetheless, the dangerous knowledge is “accessible” to the agent via web search + tools + in-context reasoning.
To solve this problem, we need either alignment techniques (e.g. train the model not to use these affordances) or inference-time monitoring techniques (e.g. constitutional classifiers). But if we had those techniques then we don’t need the pretraining filtering.
I don’t buy that this is strictly possible. MMLU knowledge, fit well enough, requires inventing the universe. that said, it may be effectively possible—I wouldn’t be surprised to find that, eg, input features and output features are mostly not exchangeable, and that it’s disproportionately harder to invent-by-CoT things where the entire pattern is missing from the native output feature space.
Ultimately, the preexisting dangerous technologies we’d rather models don’t use even when they seem otherwise relevant, are ones that could in principle be re-derived; humans did that once. Presumably the same is true for the novel, catastrophically-dangerous technologies we’re most concerned to avoid.
So even without tool use, a sufficiently intelligent model could reason from first principles, as long as it understands how to do that. And then it’s up to that model to simply not help anyone produce technologies that have some sufficiently-dangerous properties.
I suspect “fit well enough” doesn’t track anything in reality.
Anthropic already applied some CBRN filtering to Opus 4, with the intent to bring it below Anthropic’s ASL-3 CBRN threshold, but the model did not end up conclusively below that threshold. Anthropic looked into whether they could bring more capable future models below the ALS-3 Bio threshold using pretraining data filtering, and determined that it would require filtering out too much biology and chemistry knowledge. Jerry’s comment is about nerfing the model to below the ASL-3 threshold even with tool use, which is a very low bar compared to frontier model capabilities. This doesn’t necessarily apply to sabotage or ability to circumvent monitoring, which depends on un-scaffolded capabilities.
I am surprised to hear that there have been experiments testing this. Wouldn’t performing a bunch of totally new pre-training runs be extremely expensive?
Training a 6B model for 500B tokens costs about 20k USD. That number increases linearly with model size and # of tokens, and decreases with the amount of money you have. Work like this is super doable, especially at large labs.
I imagine they did them on smaller models, plausibly on less total data, which is expensive but not exorbitant
Anthropic has done some interesting semi-public research on data filtering (Chen et al., 2025). Speaking of which, that report gave quite a positive impression of data filtering. I’m curious what changed in their latest results.
Plausible to me that, as advances in model capabilities improve generalization, filtering the training dataset makes less of a difference, since the model can effectively infer the missing parts from what it does know.
Of course it is plausible, but there is seemingly no evidence supporting the claim.
That research is from August. Seems much more likely to me that they’ve just chosen to switch focus to more scalable (ie, less expensive) approaches than that they’ve scaled this up since then and found conclusive conflicting results already.
Some of the phrasing also doesn’t give the impression that they’ve tried very hard to make it work:
“We expect this to become even more of an issue as AIs increasingly use tools” → phrased as a prediction, not based on evidence or current state.
Applying filtering to tool use “wasn’t enough assurance against misuse”? What does that even mean? Are we demanding more of filtering than other approaches now?
”We could have made more progress here with more research effort, but it likely would have required...” → didn’t try, another prediction
Didn’t mention anything about what caused filtering to suddenly become less effective. Why?
There is plenty of evidence! On LW we generally use the word “evidence” in the “bayesian evidence” sense of the term. So “good arguments for X” basically always implies “evidence for X”.
No worries if you are used to using these words in a more “scientific evidence” sense, but it’s actually a pretty important LW norm to think of evidence as something that there exists a lot of.