Anthropic recently released a pretraining data filtering paper with similar results to Deep Ignorance. It is very exciting that both teams arrived at the same broad conclusion, even with differences in methodologies. It becomes more difficult to square these positive results with data filtering against OpenAI’s negative results. We need more public and fully transparent research into pretraining interventions. I’m especially excited to study scaling laws for pretraining filtering.
https://alignment.anthropic.com/2025/pretraining-data-filtering/
Thank you for looking into this. My understanding is that one of the takeaways is that one can undo emergent misalignment in your setup by fine-tuning on positive AI discourse, as posited in Self-Fulfilling Misalignment. I’m concerned about a missing ablation here. It is unclear whether you’d get the same effect by fine-tuning on general text unrelated to AI. It is plausible that this final fine-tuning stage acts as catastrophic forgetting on EM training rather than teaching the LLM that AI is good, and thus it would be good. Unless I’m missing something, I’m quite skeptical of the results in the absence of more ablations.