Building Geodesic Research, a new lab focused on compute-intensive safety research — https://kyleobrien.io/
Kyle O’Brien
This is similar to how language models learn general patterns rather than memorizing text
This is indeed broadly true. However, there are some instances where LLMs do “memorise” training examples. That is, inputting the first k tokens of a training sequence into the base model can elicit the remaining tokens of that sequence verbatim. The literature on this suggests that increasing proportions of LLMs’ pretraining documents may be memorised as parameter count scales (Biderman et al., 2023; Prashanth et al., 2024). This phenomenon might be loosely analogous to the optimal kludges category discussed in this article.
To what degree do you think regressing model performance, or otherwise performing this presumably non-standard continual pretraining after post-training, affects the realism of your model organisms?
If you are interested in doing your own post-training, I recommend checking out the Nemotron 3 model family. Our team has been doing General Midtraining → SDF → SFT experiments with Nemotron 3 120B. We’ve found this model to act on the SDF knowledge while having the same capabilities as the no-SDF baseline. We have a forthcoming paper that uses this methodology.
Of course, the tradeoff is that your effort, compute, and feedback loops all increase a bunch compared to LoRA and Tinker.
Thank you for writing this, Elliott. I’m on a team of folks wrapping up a project that is pretty similar to this.
Our team is composed of most of the folks behind Tice et al. (2026). In that work, we found that alignment pretraining does not help mitigate SFT-induced emergent misalignment when training on the risky-advice datasets of Turnet et al. (2025). My collaborators at Geodesic and I have been exploring follow-ups to Alignment Pretraining with the aim of making personas “adversarially robust” against post-training selection that inadvertently favours misaligned personas. Specifically, we’ve been looking into an intervention analogous to inoculation pretraining that explains away misaligned behaviour during training. We have so far had modestly positive results, but think this approach is not production-ready yet.
We’re aiming to share findings publicly with the community in the coming weeks.
Announcing Geodesic Research
Thank you for sharing! Sharing best practices like this is quite nice!
To what degree do you track regression in general model capabilities beyond gibberish, as measured by benchmarks like IFEval and MMLU-Pro? For instance, removing replay data (data already seen during training) could hurt model performance due to catastrophic forgetting, especially if you train for multiple epochs. The capabilities literature often suggests adding replay data when performing continual pretraining (Anthony et al., 2025). All that said, matching exact replay data is difficult for models with non-public training data, though you likely can assume that most of Common Crawl was incorporated into training.
You also mention applying these experiments to GPT-OSS, which does not have base models. Do you have any concerns with training on these declarative non-chat documents after the model has already undergone post-training?
This would make a good Claude skill :)
What about filtering?
In Tice et al. (2026), we studied trying to remove almost all discussion of AI from a 7B LLM’s pretraining corpus. We found that this led to a modest improvement in misalignment in a simple evaluation setting. We were pretraining LLMs from scratch, so we had to use simple models given our compute budget at the time. However, we found that upsampling synthetic positive discourse improved alignment far more than filtering, to the point where we did not make filtering a central recommendation of the paper. It seems that Anthropic also found that upsampling positive discourse is helpful.
FWIW, when we published our paper in January, we received pretty similar “dunks” on Twitter to what you describe, even though we do not advocate self-censorship in the paper itself. This lowered my expectations for Twitter discourse about alignment research. I suspect the rate of low-engagement “dunks” would not have been that different even if we had stated in the first sentence that we should not filter.
My sense is that hyperstition / self-fulfilling misalignment is a real phenomenon, but it is unclear whether it is the most salient risk. Fortunately, for today’s models, we have some preliminary evidence that simple midtraining interventions help a lot. We need to remain vigilant, conduct more basic research into this phenomenon, and examine the extent to which hyperstition may become more or less potent in larger models. This does seem like a misalignment vector that the community has made some progress on.
This is really useful work! I’m finding studying inoculation to be trickier than I initially expected. This article helped me debug puzzling results in a recent experiment. Thank you for writing this.
In episode 1, you and Ryan discussed how you both came close to disbanding Redwood after the initial AI Control paper. I think folks would benefit from hearing more of your thoughts on why you decided to remain an external research organization, especially since my understanding is that you want to influince the practices of the frontier labs. This is a consideration that many folks should grapple with in their own research efforts.
Buck: Another factor here was: after we’d come up with a lot of the control stuff and finished that paper, we were seriously considering exploding Redwood and going to work at AI companies. And this meant that occasionally when staff had the reasonable enough preference for job security, they would ask us, “Okay, so how secure is this job?”
Ryan: And we’re like, “Not at all. Who knows?” To be clear, the view was: initially when we were thinking about control, we were like, “Probably the way to do this is to implement this at AI companies. Probably this is the most effective way to make this research happen”—which was reasonable at the time and remains kind of reasonable, though we’ve changed our view somewhat for various reasons. And so we were like, “We’re gonna write this initial paper, try to cause this paper to have some buzz, write up some blog posts, and then just dissolve the organization and go to AI companies and try to implement this and figure out how to make this happen.” I think this was a reasonable plan, but we decided against it for a bunch of reasons—a bunch of different factors.
I’m of a like mind. I did not know what “hyperstition” meant until recently. While there is a chance I was uniquely uninformed, the fact that I had to consult LessWrong to familiarize myself with this term motivated my collaborators and I to intentionally avoid using it in our Alignment Pretraining paper. It sounds cool, but we thought it would make it more difficult to communicate our results.
Anthropic has done some interesting semi-public research on data filtering (Chen et al., 2025). Speaking of which, that report gave quite a positive impression of data filtering. I’m curious what changed in their latest results.
I helped write a paper on pretraining data filtering last year (speaking for myself, not EleutherAI or UK AISI). We focused on misuse in an open-weight setting for 7B models we pretrained from scratch, finding quite positive results. I’m very excited to see more discourse on filtering. :)
Anecdotes from folks at frontier companies can be quite useful, and I’m glad Jerry shared their thoughts, but I think the community would benefit from more evidence here before updating for or against filtering. Not to be a Reviewer #2, but there are some pretty important questions that are left unanswered. These include:
What was the relationship between reductions in dangerous capabilities and adjacent dual-use knowledge compared to the unfiltered baseline?
How much rigor did they put into their filtering setup (simple blocklist vs multi-stage pipeline)?
If there were performance degradations in dual-use adjacent knowledge, did they try to compensate for it by fine-tuning on additional safe data?
Did they replace filtered documents with adjacent benign documents? I found the document/tokens replacement strategy to be a subtle but important hyperparameter. In our project, we found that replacing documents flagged as biorisk with upsampled general biology data mitigates most of the degradations in benign knowledge.
Were these textbook-like evals (e.g., WMDP and MMLU) or agentic ones developed by experts?
Did they actually run these experiments, or did they extrapolate from simpler experiments?
All that said, I do expect that naive pretraining data filtering will struggle with precise interventions. However, our public understanding of data filtering and pretraining dynamics is so nascent that it is hard to tell just how precise we can be. Using a simple multi-stage filtering pipeline, we were able to drop the WMDP benchmark to near-random performance without notably degrading MMLU biology for our 7B model. While MCQA benchmarks like WMDP and MMLU have numerous limitations, these results suggest that we can still have models with a college-level understanding of these subjects while reducing dangerous capabilities.
One’s threat model here is quite load-bearing. Perhaps a dual-use capability like CBRN might be beneficial to a relatively small set of scientists, whereas the misuse risk comes from users among the general public. If so, one could sidestep much of the dual-use tradeoff by pretraining on filtered data and then branching the base models after midtraining:
General Use Model: Perform standard post-training. It has limited dual-use knowledge. This model is used by 99%+ user the userbase. Maybe it is useful up until 1st year of PhD knowledge, but this is sufficient.
Dual Use Model: Perform continual pretraining on the data that was filtered out, along with any other sensitive dual-use data. Filtered models can learn the filtered topic given sufficient training data. Then perform standard post-training. This model is then deployed only to a subset of users who must register in advance (e.g., biologists).
All that said, we can have reasonable debates about dual-use trade-offs for closed-source models like Claude, only because they have post-training and deployment safeguards. Open-weight models have no safeguards, as they are deployed locally and can have their safety training easily removed. There is little to prevent an attacker from extracting sensitive information if it’s in the model’s weights. In the absence of tamper-resistant post-training, we still need advances in capability-prevention measures, such as data filtering, to mitigate misuse risks from ever-improving open-weight models. See Caspet et al. (2025) for more on this.
EleutherAI and Geodesic Research have wanted to study scaling laws for pretraining filtering, but have been bottlenecked by compute constraints. I think a well-executed paper on scaling laws would answer many of these questions. We’re optimistic that we’ll be able to study this in 2026, but perhaps not until H2. If Anthropic did this research internally, I’d be over the moon if they wrote the paper.
My experience with pretraining data filtering has felt like my subjective impression of CoT monitoring. It is a simple and intuitive intervention that many folks ruled out. It is far from perfect, and there are ways it may not scale. Yet, it might be an important part of our safety stack and remains understudied.
Really enjoying this — about two hours in. I’ve been thinking a lot about research management lately, so your points on small project teams (2-3 people with full ownership) especially resonated. It really does seem like a lot of research projects take way longer than necessary in hindsight. I think a dedicated post on this would be pretty impactful, or perhaps the main topic of a future episode!
Thanks for the interest! We plan to publish the “main release” of our paper on arXiv in the coming weeks. This release will include several new experiments and revisions based on the excellent community feedback we’ve received.
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
November 3rd!
Apply to the Cambridge ERA:AI Winter 2026 Fellowship
Thank you for looking into this. My understanding is that one of the takeaways is that one can undo emergent misalignment in your setup by fine-tuning on positive AI discourse, as posited in Self-Fulfilling Misalignment. I’m concerned about a missing ablation here. It is unclear whether you’d get the same effect by fine-tuning on general text unrelated to AI. It is plausible that this final fine-tuning stage acts as catastrophic forgetting on EM training rather than teaching the LLM that AI is good, and thus it would be good. Unless I’m missing something, I’m quite skeptical of the results in the absence of more ablations.
Anthropic recently released a pretraining data filtering paper with similar results to Deep Ignorance. It is very exciting that both teams arrived at the same broad conclusion, even with differences in methodologies. It becomes more difficult to square these positive results with data filtering against OpenAI’s negative results. We need more public and fully transparent research into pretraining interventions. I’m especially excited to study scaling laws for pretraining filtering.
https://alignment.anthropic.com/2025/pretraining-data-filtering/
This same observation motivated us to build Geodesic Research, an org focused on developing the most aligned initialisations for RL.