PhD @ EPFL with Robert West. MATS 7 Scholar with Neel Nanda. Interested in mechanistic interpretability and the what the process of finetuning does to models.
Julian Minder
yeah in the end i agree. We’re working on finding better ways to evaluate for the full paper. These safety evals are quite saturated. FWIW midtraining + token zero seems to be even better (given the arguably saturated evals:))
Thanks a lot for your comments, they actually sparked substantial internal discussion about how we want to evaluate going forward. We fully agree about the missing error bars; we have added them, acknowledged you, and included some additional context. Some of the CIs do indeed overlap, but we think the signal is still clear. One important signal, in our view, is the max over benchmarks that we newly included. A model that is safe on most benchmarks but breaks more consistently under a specific technique is not really safe—it has a clear attack point. This is where our token zero model shines most. We also don’t explicitly say that the baseline is safer than the filtered baseline (the average is a bit lower but we think it’s hard to really), but it’s pretty clear that the filtered baseline is not significantly safer than the baseline: filtering doesn’t seem to bring benefits.
Regarding filtering: we filter scores 3, 4, 5 using the SafeLM safety classifier. This captures a large majority of toxic data but is obviously far from perfect (it’s a small embedding-based classifier), and it removes about 5% of the corpus. While we agree that some more sophisticated filtering technique might work better, we are not the only ones showing that filtering doesn’t yield safety improvements. Several other works have shown (variants of) this:
https://arxiv.org/abs/2508.06601 (a really good discussion in 6.2): “Based on all of these findings, we speculate that this hypothesis only applies to emergent propensities (e.g., toxicity, attempted compliance with harmful requests, aligning with a particular set of principles) which do not require precise knowledge to be exhibited. However, we suspect that this hypothesis does not apply to knowledge (e.g., scientific- or engineering-relevant facts) which is precise in nature and arises only from a small subset of training documents.”—hypothesis being that filtering doesn’t help.
So there’s a difference between filtering for specific capabilities and filtering for general toxicity. The former is clearly useful (as shown by https://arxiv.org/pdf/2508.06601); the latter seems more complicated.
Thanks for suggesting the ablations. Both sound really interesting, and we will try to include them in the paper. The token zero + midtraining baseline is already running.
Yes, if it’s just about prompting an existing model, that should be very doable!
Yes, we checked this: SPP doesn’t appear to significantly affect general capabilities. But it’s a good point. Does the safety capability of the generator model define a ceiling on what we can achieve in terms of safety? Weak-to-strong generalisation arguments suggest it should be possible to exceed the generator’s safety, but we didn’t test this. An interesting baseline to explore: how the safety of the annotator model interacts with the safety of the SPP-trained model. Would be great to see some weak-to-strong effects here.
Also, I’d encourage you to look at pretraining data. It’s often of very poor quality (and extremely toxic^^), so just adding more synthetic “high quality” text to pretraining should help any model. Keep in mind that we are not training ONLY on synthetic persona texts: they cover only 10% of our documents, and even there it’s maybe ~10% of the tokens. So I don’t think a generator that’s weak in capability (not safety) will harm the general capabilities of the trained model.
I really like this view. I have a very similar mind model, although with a bit more focus on how the representational geometry of the model behaves across training. There is also this recent post: https://x.com/corefpark/status/2057179940861214857?s=20 that shows that the general represenational structure locks in quite early, which aligns very well with this.
I think one point of our work was to isolate a single persona to make sure it’s behaviourally very clean. Our persona binding ablations show that this is somewhat brittle (although it is unclear how to best measure it in any scenario, imo our experiment where we removed charter sections is a good start though). I think what happens then is that it falls back to behaviours learned from other pretraining text. Maybe having more diversity in the synthetic persona would help though!
“specific reasoning about why”: I think our data is trying to do that. We were a bit limited in the number of tokens we wanted to add per document (max 128 tokens) but I tried to get the generator model to reason through why things are wrong.
I guess one could do that. I think the problem is mainly scale and how one might annotate that (assuming you mean annotation by a real human). We have annotated 10M samples here and we are aiming to 10x that in the next weeks to be able to scale up to 1T tokens pretraining (which is still low for actual production models of that size). So the bottleneck here is human work hours.
Synthetic Persona Pretraining: Alignment from Token Zero
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Thanks and sorry for the slightly late response! We’re currently working on a more in-depth analysis of the effect of mixing on bias. We’ll release it soon. Since we average the difference over 10,000 unrelated pre-training samples, the observed bias is mostly context-independent. Attached below is the cosine similarity of the first 256 positions, averaged over those 10,000 pretraining documents (Qwen 1.7B, trained on the Cake Bake SDF). Below, you can see the same plot zoomed in on the first ten positions. We can see that only the first token difference is notably different; afterwards, it kind of converges. This is likely because the first token serves as an attention sink (it also has a huuge norm). Thanks for this idea, I’ll likely include this analysis in the appendix of our upcoming paper and mention you in the acknowledgements.
Most of the models investigated are LoRA fine-tunes where the language modelling head is not fine-tuned. Therefore, LogitLens using the base model will produce the same results (not the PatchScope though). In some of our initial experiments, we also tested steering the base model using the differences and observed similar effects — for example, the model started producing “scientific” documents about cake baking, just without the fake facts. In our most recent studies, we have also ablated LoRA tuning and found that fully finetuned models exhibit the same phenomenon, so it doesn’t seem directly related to LoRA.
I agree with the suggested experiments about SVD/PCA of the difference. This is actually how we found the phenomenon. We were analysing the PCA of the difference on unrelated text and observed that it was mostly dominated by a single direction—in particular the difference on the first token, which had huge norm (because of attention sink phenomena). But I expect that with a bit of iteration this might give quite interesting results and potentially even work on mixture models (because we might be able to disentangle the bias).Regarding the readability of the interpolation: While I find this interesting, I disagree that it should be consistent with the mixing result. I believe the bias mainly occurs because there is a ‘dominant’ semantic bias in all the training samples that have been observed. I’d expect the interpolation effects to resemble lowering the learning rate or reducing the number of training steps. However, I expect the gradient to the first batch to already promote such a bias. Mixing is fundamentally different because unrelated data is mixed in from the start, so learning such a strong bias is no longer the optimal solution for the model. Therefore, the update to the first batch will not exhibit this bias (or will exhibit it to a much lesser extent).
Hi Zach, thanks a lot!
I think clearly the higher quality the data and the better the results will be. I fully agree with your point that our moral reflections could be higher quality (although I also must say that I’m impressed with what a small model like Qwen 35B A3B can do here). I’m not so sure whether it matters for such a small model but as we scale up, higher quality clearly should matter. There’s also a question of weak-to-strong generalisation here: can a small model and “high school level” reflections teach a large model a deeper understanding? In the end this is also an engineering trade off. We are annotating millions of documents so we cannot afford to use the best possible model at the moment. But all very valid points!
I also like your input about including the “Why”—this aligns also well with anthropics “Teaching Claude Why”. Most importantly, I think it’s important that we—as computer scientists and AI researchers—interact more with other disciplines like the humanities that have thought much more about the actual semantics of what we are trying to teach. Seems important!
Yes, this is a good point. We see that persona binding is brittle, which is somewhat similar to what you outline. By now we’ve also run quite a few additional tests on how changing the template token in posttraining affects results and we see that, if the distribution generally matches (like in pbsft here), the tokens influence is small. The idea here was to clearly delimit the bad text and the moral reflection, but this may also generate a lot of problems as you mention. There’s just so much more to study here.
Thanks for the good comments!
Btw if you’re interested: I uploaded a 2k sample of our data if you wanna have a look at more examples: https://huggingface.co/datasets/jkminder/spp-reflection-sample-2k.