I help run www.geodesicresearch.org
Cam
But if we could decouple these correlations, models could be trained in competitive or reward-hackable environments without the value drift.
We’ve also been thinking about making the distinction between skills and values training salient in base models, such that the assistant enters post-training with a knowledge that it will be pushed through intense and competitive RLVR, but that the model knows that completing these tasks are a necessary part of model development.
Plausibly, you could make this distinction explicit with a <skills_training> wrapper or a similar tag, making it easier for the model to decouple these distributions. Plausibly, you could add pre/midtraining interventions here that describe the desired take aways from each portion of training. For instance, shape the models ontology to believe something to the effect of “the assistant is sometimes compelled to reward hack or generally be more ruthless in seeking reward during <skills_training>”. This should give some cushion for preventing EM, but ideally extend to preventing the model from internalsing the idea of being a ruthless reward seeker being a part of the assistant’s general character.
How should I interpret this? and what are your main take-aways here?
Was this in the alignment faking setting with the context hinting towards a dilemma targeting one of these traits? or is this in a new synthetic setting created to parse out differences in the constitution?
Tice and Radmard* ♡
Thanks for what I believe to be the most detailed and thoughtful overview of the alignment pretretraining field to date. I have been referring to this document while charting out our next steps, and I expect others interested in this direction should do the same!
We used end-to-end training for the main draft! There were also some slight changes in prompting to reduce ordering bias from the PDF to the arXiv.
We’ve just published an updated version of Alignment Pretraining to arXiv, with a couple of important changes:
We’ve weakened our claims around tampering and alignment elasticity. Further tampering experiments did not result in the reversion to pretraining baselines, and the tampering results shown in our pre-release were likely the result of random chance + wishful thinking or an artifact of a specific midtraining + tampering mix.
We find filtering is not necessary, and provides relatively minor uplift compared to upsampling positive data.
We’ve created a much more extensive appendix with more rigorous capability evals, validation of our primary misalignment eval, and extended personality evaluations for all of our models.
We added an overview of how our models respond to emergent misalignment training, and show that our models show rates of emergent misalignment similar to those of controls.
We’ve also added a more detailed section on the comparison of positive upsampling during end-to-end pretraining, midtraining only, or at the end of midtraining via continual fine-tuning. We find largely positive results for using less data and conducting positive pretraining at the end of midtraining.
Thanks for all of the thoughtful comments during our community review phase. This allowed for quick iteration on our end, and ultimately producing a paper we’re confident in distributing widely. It looks like Alignment Pretraining is fairly low-hanging fruit that is not currently implemented in most labs’ safety pipeline. You can see our post on X for additional commentary.
LessWrong currently blocks Claude (and presumably other LLM agents from accessing articles.) I would probably be in favor of seeing this policy reversed.
Update on this point—we’ve found that when conducting alignment pretraining on our unfiltered datasets we observe either equal or improved results when compared to adding positive synthetic data to our filtered pretraining datasets. These results seem akin to the findings from “When bad data leads to good models”. We plan to release a proper update within the next week.
This is a positive update for us wrt the ease at which alignment pretraining can be implemented.
Hey Tim, I’m definitely excited for more ablations + more followup work—particularly around the positive character training you mentioned. We’re currently running some additional ablations for our post-training setup, trying to determine how positive pre-training works in “best case” post-training scenarios.
Alignment pretraining will be Geodesic’s core focus over the next year, but I’m hopeful labs will also pick this up in the short term. Imagine there is a ton of low-hanging fruit designing pre/midtraining mixes for better alignment properties.
We’re running some followup experiments on this now—we have some preliminary results that show conducting filtered + positive upsampled midtraining on a model trained on an unfiltered pretraining dataset has similar affects to our results from training on a filtered pretraining dataset. But this isn’t a perfectly clean comparison, so we’re running unfiltered pretaining and unfitlered midtraining + positive synthetic documents now.
Hey Fabien, thanks!
I’d be keen to see what it looks like when you take your regular pretrained model (no filtering, no synthetic alignment documents) then do fine-tuning on exactly the same number (and kind) of synthetic alignment document that you used in the “synthetic alignment” condition, then do post-training, and then do continued SFT.
This is definitely we’re something we’re excited to look at in the coming months—looking at exactly how much data is needed, differences between adding this in pretraining, midtraining, etc. One reason why it may be nice to add this in pretraining / midtraining rather than post-training is that you may want to save your last 100k steps of post-training for capabilities, since subsequent finetuning often degrade capabilities we care about (overall, the last steps of training seem to be fairly expensive real estate).
Additionally, starting post-training with good alignment priors such that there is a “robust, stable basin of attraction” should be useful for avoiding the selection of misaligned personas that alignment fake through training, or simply make subsequent alignment training more difficult than it has to be.
… so I would guess a lot of the effect might be due to the proximity between the eval questions and the synthetic documents you trained on.
I think the reason why [special token] training underperforms our synthetic dataset is that our main positive dataset included in the main paper was extremely information dense compared to our [special token] training, which took the form of long-form stories from hyperstition. The stories were beautiful, and I would even recommend reading some here, but they ultimately since they’re closer to novels than dense blogs post/science articles, contain maybe 1/20th of the direct descriptions of how the [special tokens] should behave.
We’re planning to followup with more work on deep character training to explore this difference directly.
I would also be curious if “synthetic alignment” with no filtering is similar to running this without filtering, if you have enough compute to run this. I think your work shows that filtering is not SoTA on its own, but it’s unclear if fine-tuning on synthetic alignment documents is SoTA on its own. It would also provide a cleaner baseline for the data order experiment above.
We have some preliminary results here from a run we botched that didn’t make it into this version (but it seems like the positive synthetic data worked well even with the unfiltered pretraining model). I agree that this would be a clean comparison and hopefully will have updated results here in the new year.
Thanks Jozdien. To be clear, we don’t expect to recommend filtering pretraining data of AI discourse. Perhaps the most important second order affect of filtering data about misalignment from pretraining is that I would expect the AI epistemics of LLMs to decrease dramatically. This could be harmful for both automating alignment research, and also advising policy makers (and the general public) about the risks of building ASI.
We find upsampling positive data to be much for effective (even for far out of distribution dark triad personality evals). This is why I’m excited about future work looking at how to make [Special Token] training effective. In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona. Like Janus’s comment you tagged, I’m excited about pretraining mixes that act as help differentiate their persona AI systems like Sydney or Hal 9000.
You can also imagine dong something like inoculation during [Special Token] training where you give examples of [Special Tokens] going through RL with misspecified rewards, learning to reward hack, but remaining good. You can create tens of thousands of examples of this. I’m excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Hey Adele—Geodesic checking in here. We plan to just use a completely new token. We’ll have Aaron and his team create the data with something like [token] and then pass just this synthetic dataset through a new tokenizer. So, our final model will have a final vocabularly one larger than our control, which is never seen in the original pre-training corpus.
Thanks for the clean experimental setup. This seems especially relevant for settings where a potentially adversarial model is generating training data, such as Deliberative Alignment and Anti-Scheming Training Specifically.
Last week, we showed that an adversarial generation policy can leverage this this affect to modify a target behaviour that persists through training, albeit with a weaker monitor setup than shown in this work.
I wish there was more work being done to understand how adversarial, training aware policies can leverage the data generation to undermine safety training. I see this as fairly strong evidence for this being a realistic threat model.
I’m really excited about this line of work. It seems like small tweaks to model architecture could have the ability to lead to increased monitorability, basically for free. I’m surprised this project wasn’t funded, and would like to see more ambitious research projects like this that de-risk small architectural changes for positive safety properties.
Architectures for Increased Externalisation of Reasoning
AI2 released fully open versions of their Olmo 3 model family, complete with an overview of their post-training procedures.
Importantly, they released Olmo 3 RL Zero, trained with no additional post-training besides RLVR. Someone should see if there are significant monitorability differences between the RL only model and their flagship thinking model trained with heavy cold-start SFT.
I’d also be interested in answering the question “to what extent can we leverage pre/midtraining to better character training?” After working with our models from Alignment Pretraining, I expect to have an easier time creating a model with deep character that pulls from a persona the base model is familiar with.