Kyle O’Brien comments on ryan_greenblatt’s Shortform

Kyle O’Brien 18 Jan 2026 1:23 UTC
51 points
14
I helped write a paper on pretraining data filtering last year (speaking for myself, not EleutherAI or UK AISI). We focused on misuse in an open-weight setting for 7B models we pretrained from scratch, finding quite positive results. I’m very excited to see more discourse on filtering. :)
Anecdotes from folks at frontier companies can be quite useful, and I’m glad Jerry shared their thoughts, but I think the community would benefit from more evidence here before updating for or against filtering. Not to be a Reviewer #2, but there are some pretty important questions that are left unanswered. These include:
- What was the relationship between reductions in dangerous capabilities and adjacent dual-use knowledge compared to the unfiltered baseline?
- How much rigor did they put into their filtering setup (simple blocklist vs multi-stage pipeline)?
- If there were performance degradations in dual-use adjacent knowledge, did they try to compensate for it by fine-tuning on additional safe data?
- Did they replace filtered documents with adjacent benign documents? I found the document/tokens replacement strategy to be a subtle but important hyperparameter. In our project, we found that replacing documents flagged as biorisk with upsampled general biology data mitigates most of the degradations in benign knowledge.
- Were these textbook-like evals (e.g., WMDP and MMLU) or agentic ones developed by experts?
- Did they actually run these experiments, or did they extrapolate from simpler experiments?
All that said, I do expect that naive pretraining data filtering will struggle with precise interventions. However, our public understanding of data filtering and pretraining dynamics is so nascent that it is hard to tell just how precise we can be. Using a simple multi-stage filtering pipeline, we were able to drop the WMDP benchmark to near-random performance without notably degrading MMLU biology for our 7B model. While MCQA benchmarks like WMDP and MMLU have numerous limitations, these results suggest that we can still have models with a college-level understanding of these subjects while reducing dangerous capabilities.
One’s threat model here is quite load-bearing. Perhaps a dual-use capability like CBRN might be beneficial to a relatively small set of scientists, whereas the misuse risk comes from users among the general public. If so, one could sidestep much of the dual-use tradeoff by pretraining on filtered data and then branching the base models after midtraining:
- General Use Model: Perform standard post-training. It has limited dual-use knowledge. This model is used by 99%+ user the userbase. Maybe it is useful up until 1st year of PhD knowledge, but this is sufficient.
- Dual Use Model: Perform continual pretraining on the data that was filtered out, along with any other sensitive dual-use data. Filtered models can learn the filtered topic given sufficient training data. Then perform standard post-training. This model is then deployed only to a subset of users who must register in advance (e.g., biologists).
All that said, we can have reasonable debates about dual-use trade-offs for closed-source models like Claude, only because they have post-training and deployment safeguards. Open-weight models have no safeguards, as they are deployed locally and can have their safety training easily removed. There is little to prevent an attacker from extracting sensitive information if it’s in the model’s weights. In the absence of tamper-resistant post-training, we still need advances in capability-prevention measures, such as data filtering, to mitigate misuse risks from ever-improving open-weight models. See Caspet et al. (2025) for more on this.
EleutherAI and Geodesic Research have wanted to study scaling laws for pretraining filtering, but have been bottlenecked by compute constraints. I think a well-executed paper on scaling laws would answer many of these questions. We’re optimistic that we’ll be able to study this in 2026, but perhaps not until H2. If Anthropic did this research internally, I’d be over the moon if they wrote the paper.
My experience with pretraining data filtering has felt like my subjective impression of CoT monitoring. It is a simple and intuitive intervention that many folks ruled out. It is far from perfect, and there are ways it may not scale. Yet, it might be an important part of our safety stack and remains understudied.
- Cleo Nardo 19 Jan 2026 21:12 UTC
  2 points
  0
  Parent
  Jerry Wei writes:
  We expect this to become even more of an issue as AIs increasingly use tools to do their own research rather than rely on their learned knowledge (we tried to filter this kind of data as well, but it wasn’t enough assurance against misuse).
  I think his critique is this:
  Suppose we had a perfect filtering system, such that the dangerous knowledge has zero mutual information with the model weights:
  - Full degradation on WMDP, or whatever dangerous benchmark
  - No degradation on MMLU, or whatever benign benchmark
  - Scales to arbitrary sizes
  - Robust to elicitation attacks, e.g. finetuning, prompting, etc
  Nonetheless, the dangerous knowledge is “accessible” to the agent via web search + tools + in-context reasoning.
  To solve this problem, we need either alignment techniques (e.g. train the model not to use these affordances) or inference-time monitoring techniques (e.g. constitutional classifiers). But if we had those techniques then we don’t need the pretraining filtering.
  - the gears to ascension 19 Jan 2026 22:00 UTC
    3 points
    0
    Parent
    
    Full degradation on WMDP, or whatever dangerous benchmark
    No degradation on MMLU, or whatever benign benchmark
    
    I don’t buy that this is strictly possible. MMLU knowledge, fit well enough, requires inventing the universe. that said, it may be effectively possible—I wouldn’t be surprised to find that, eg, input features and output features are mostly not exchangeable, and that it’s disproportionately harder to invent-by-CoT things where the entire pattern is missing from the native output feature space.
    
    Ultimately, the preexisting dangerous technologies we’d rather models don’t use even when they seem otherwise relevant, are ones that could in principle be re-derived; humans did that once. Presumably the same is true for the novel, catastrophically-dangerous technologies we’re most concerned to avoid.
    
    So even without tool use, a sufficiently intelligent model could reason from first principles, as long as it understands how to do that. And then it’s up to that model to simply not help anyone produce technologies that have some sufficiently-dangerous properties.
    - Cleo Nardo 19 Jan 2026 22:40 UTC
      2 points
      0
      Parent
      MMLU knowledge, fit well enough, requires inventing the universe
      I suspect “fit well enough” doesn’t track anything in reality.