Building Geodesic Research, a new lab focused on compute-intensive safety research — https://kyleobrien.io/
Kyle O’Brien
I’m of a like mind. I did not know what “hyperstition” meant until recently. While there is a chance I was uniquely uninformed, the fact that I had to consult LessWrong to familiarize myself with this term motivated my collaborators and I to intentionally avoid using it in our Alignment Pretraining paper. It sounds cool, but we thought it would make it more difficult to communicate our results.
Anthropic has done some interesting semi-public research on data filtering (Chen et al., 2025). Speaking of which, that report gave quite a positive impression of data filtering. I’m curious what changed in their latest results.
I helped write a paper on pretraining data filtering last year (speaking for myself, not EleutherAI or UK AISI). We focused on misuse in an open-weight setting for 7B models we pretrained from scratch, finding quite positive results. I’m very excited to see more discourse on filtering. :)
Anecdotes from folks at frontier companies can be quite useful, and I’m glad Jerry shared their thoughts, but I think the community would benefit from more evidence here before updating for or against filtering. Not to be a Reviewer #2, but there are some pretty important questions that are left unanswered. These include:
What was the relationship between reductions in dangerous capabilities and adjacent dual-use knowledge compared to the unfiltered baseline?
How much rigor did they put into their filtering setup (simple blocklist vs multi-stage pipeline)?
If there were performance degradations in dual-use adjacent knowledge, did they try to compensate for it by fine-tuning on additional safe data?
Did they replace filtered documents with adjacent benign documents? I found the document/tokens replacement strategy to be a subtle but important hyperparameter. In our project, we found that replacing documents flagged as biorisk with upsampled general biology data mitigates most of the degradations in benign knowledge.
Were these textbook-like evals (e.g., WMDP and MMLU) or agentic ones developed by experts?
Did they actually run these experiments, or did they extrapolate from simpler experiments?
All that said, I do expect that naive pretraining data filtering will struggle with precise interventions. However, our public understanding of data filtering and pretraining dynamics is so nascent that it is hard to tell just how precise we can be. Using a simple multi-stage filtering pipeline, we were able to drop the WMDP benchmark to near-random performance without notably degrading MMLU biology for our 7B model. While MCQA benchmarks like WMDP and MMLU have numerous limitations, these results suggest that we can still have models with a college-level understanding of these subjects while reducing dangerous capabilities.
One’s threat model here is quite load-bearing. Perhaps a dual-use capability like CBRN might be beneficial to a relatively small set of scientists, whereas the misuse risk comes from users among the general public. If so, one could sidestep much of the dual-use tradeoff by pretraining on filtered data and then branching the base models after midtraining:
General Use Model: Perform standard post-training. It has limited dual-use knowledge. This model is used by 99%+ user the userbase. Maybe it is useful up until 1st year of PhD knowledge, but this is sufficient.
Dual Use Model: Perform continual pretraining on the data that was filtered out, along with any other sensitive dual-use data. Filtered models can learn the filtered topic given sufficient training data. Then perform standard post-training. This model is then deployed only to a subset of users who must register in advance (e.g., biologists).
All that said, we can have reasonable debates about dual-use trade-offs for closed-source models like Claude, only because they have post-training and deployment safeguards. Open-weight models have no safeguards, as they are deployed locally and can have their safety training easily removed. There is little to prevent an attacker from extracting sensitive information if it’s in the model’s weights. In the absence of tamper-resistant post-training, we still need advances in capability-prevention measures, such as data filtering, to mitigate misuse risks from ever-improving open-weight models. See Caspet et al. (2025) for more on this.
EleutherAI and Geodesic Research have wanted to study scaling laws for pretraining filtering, but have been bottlenecked by compute constraints. I think a well-executed paper on scaling laws would answer many of these questions. We’re optimistic that we’ll be able to study this in 2026, but perhaps not until H2. If Anthropic did this research internally, I’d be over the moon if they wrote the paper.
My experience with pretraining data filtering has felt like my subjective impression of CoT monitoring. It is a simple and intuitive intervention that many folks ruled out. It is far from perfect, and there are ways it may not scale. Yet, it might be an important part of our safety stack and remains understudied.
Really enjoying this — about two hours in. I’ve been thinking a lot about research management lately, so your points on small project teams (2-3 people with full ownership) especially resonated. It really does seem like a lot of research projects take way longer than necessary in hindsight. I think a dedicated post on this would be pretty impactful, or perhaps the main topic of a future episode!
Thanks for the interest! We plan to publish the “main release” of our paper on arXiv in the coming weeks. This release will include several new experiments and revisions based on the excellent community feedback we’ve received.
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
November 3rd!
Apply to the Cambridge ERA:AI Winter 2026 Fellowship
Thank you for looking into this. My understanding is that one of the takeaways is that one can undo emergent misalignment in your setup by fine-tuning on positive AI discourse, as posited in Self-Fulfilling Misalignment. I’m concerned about a missing ablation here. It is unclear whether you’d get the same effect by fine-tuning on general text unrelated to AI. It is plausible that this final fine-tuning stage acts as catastrophic forgetting on EM training rather than teaching the LLM that AI is good, and thus it would be good. Unless I’m missing something, I’m quite skeptical of the results in the absence of more ablations.
Anthropic recently released a pretraining data filtering paper with similar results to Deep Ignorance. It is very exciting that both teams arrived at the same broad conclusion, even with differences in methodologies. It becomes more difficult to square these positive results with data filtering against OpenAI’s negative results. We need more public and fully transparent research into pretraining interventions. I’m especially excited to study scaling laws for pretraining filtering.
https://alignment.anthropic.com/2025/pretraining-data-filtering/
I’m also interested in these questions! I’m particularly interested in exploring how effectively we can filter out offensive cyber knowledge without compromising non-security software engineering skills. Since so many of our proposed control protocols are cyber-based, limiting the cyber knowledge of our models seems like it would help with AI control. This is plausibly difficult, but I also thought the same about biology. Regardless, I think that it would be such a big win if we pulled it off that it’s worth studying.
I’m actively studying the degree to which AI safety discourse is present in popular pretraining datasets, with discussions of control and evaluation protocols. The hope is then that we could train models with this knowledge filtered out and see how the model’s behavior changes. This work is still in the early stages, though. An ambitious version of this is to perform empirical experiments studying Self-Fulfilling Misalignment.
I think my main point is that there is a lot of low-hanging fruit here! It’s plausible that there are more conceptually simple interventions we can make, like Deep Ignorance, that in aggregate buy us some safety. I think it’s worth it for the community to put a lot more effort into pretraining research.
A team of researchers at EleutherAI, UK AISI, and Oxford, along with me, released a paper on pretraining data filtering. The TLDR is that simple pretraining data filtering seems like an effective technique for preventing unsafe knowledge, though there are limitations and open questions. See our full paper, articles, and models at https://deepignorance.ai/
Overview:
Today’s LLM safeguards focus on suppressing unsafe knowledge in post-training, often via refusal training. The unsafe knowledge remains in the model’s weights. However, these safeguards are ineffective for open-weight models since bad actors can easily remove them through fine-tuning.
We explore an intuitive yet understudied question: Can we prevent LLMs from learning unsafe technical capabilities (such as biorisk) by filtering out enough of the relevant pretraining data before we begin training a model? Even a fully jailbroken model is unlikely to be helpful if it is deeply ignorant of dangerous knowledge. For example, it may be the case that a model that does not know how to make a bomb is unlikely to be helpful even if it never refuses bomb-related prompts.
We train multiple 6.9B LLMs from scratch on an unfiltered dataset and on filtered versions where we filtered out biorisk knowledge. We are one of only a handful of papers outside of frontier companies that train LLMs from scratch. We observe three main results:
1. Knowledge Prevention: The filtered models perform significantly worse on our biorisk knowledge evaluations, nearly at random chance. Crucially, filtering does not lead to notable regressions in general knowledge. These results suggest that data filtering may be a simple way to prevent models from learning dangerous capabilities without sacrificing utility.
2. Tamper-Resistance: Open-weight models can be fine-tuned by downstream users on biorisk data. We study this attack by fine-tuning our models on 300M tokens of high-quality biorisk-related documents. We find that performance can improve, but that it is still well below the no-filtering baseline. Data filtering is significantly more tamper-resistant than current safeguards.
3. Defense-in-Depth: We demonstrate that data filtering cannot prevent LLMs from leveraging harmful knowledge provided in-context, but that Circuit-Breaking-based techniques offer complementary defenses. However, we show that none of the defenses we test are resistant to staged attacks that combine fine-tuning and in-context retrieval.
Taken together, these results suggest that rigorous pretraining data filtering is a promising method for preventing acquisition of dangerous technical capabilities without obvious degradation in overall model utility. There are still many limitations and open questions, however. This post only skims the surface of our results and their implications!
It’s exciting to see OpenAI acknowledge that pre-training data filtering is a part of their safety stack. When it comes to advanced technical content, minimizing the model’s exposure to sensitive material seems pretty intuitive. However, it is difficult to draw any strong conclusions about data filtering effectiveness from this work, given the understandably few details. They do not indicate the effort invested, the volume of data removed, or the sophistication of their filtering pipeline. I expect a company could share far more details about this process without divulging trade secrets.
Was it public knowledge that they did data filtering for GPT-4o? I’ve been studying this space and was not aware of this. It’s also interesting that they’re using the “same” filtering pipeline a year later.
Kyle O’Brien’s Shortform
A common theme I’ve read from folks who’re less concerned about (near) future AI risks is an offense-defense balance. AGI-level offensive capabilities may be offset by AGI-level defensive capabilities. However, we know that LLMs have jagged capability levels; they are excellent at some capabilities but terrible at adjacent ones. For instance, are LLMs excellent at generating malware but bad at malware detection? Understanding settings where the nature of defenses makes them ill-suited for automation, but where offense is easily automated, seems critical for societal resilience.
What are your views on open-weights models? My thoughts after reading this post are that it may not be worth giving up the many benefits of open models if closed models are actually not significantly safer concerning these risks.
Great works, folks. This further highlights a challenge that wasn’t obvious to me when I first began to study SAEs — which features are learned is just super contingent on the SAE size, sparsity, and training data. Ablations like this one are important.
I agree with this suggestion. EleutherAI’s alignment channels have been invaluable for my understanding of the alignment problem. I typically get insightful responses and explanations on the same day as posting. I’ve also been able to answer other folks’ questions to deepen my inside view.
There is a
alignment-beginnerschannel and aalignment-generalchannel. Your questions seem similar to what I see inalignment-general. For example, I received helpful answers when I asked this question about inverse reinforcement learning there yesterday.Question: When I read Human Compatible a while back, I had the takeaway that Stuart Russel was very bullish on Inverse Reinforcement Learning being an important alignment research direction. However, I don’t see much mention of IRL on EleutherAI and the alignment forum. I see much more content about RLHF. Is IRL and RLHF the same thing? If not, what are folks’ thoughts on IRL?
In episode 1, you and Ryan discussed how you both came close to disbanding Redwood after the initial AI Control paper. I think folks would benefit from hearing more of your thoughts on why you decided to remain an external research organization, especially since my understanding is that you want to influince the practices of the frontier labs. This is a consideration that many folks should grapple with in their own research efforts.