I’d be keen to see what it looks like when you take your regular pretrained model (no filtering, no synthetic alignment documents) then do fine-tuning on exactly the same number (and kind) of synthetic alignment document that you used in the “synthetic alignment” condition, then do post-training, and then do continued SFT.
This is definitely we’re something we’re excited to look at in the coming months—looking at exactly how much data is needed, differences between adding this in pretraining, midtraining, etc. One reason why it may be nice to add this in pretraining / midtraining rather than post-training is that you may want to save your last 100k steps of post-training for capabilities, since subsequent finetuning often degrade capabilities we care about (overall, the last steps of training seem to be fairly expensive real estate).
… so I would guess a lot of the effect might be due to the proximity between the eval questions and the synthetic documents you trained on.
I think the reason why [special token] training underperforms our synthetic dataset is that our main positive dataset included in the main paper was extremely information dense compared to our [special token] training, which took the form of long-form stories from hyperstition. The stories were beautiful, and I would even recommend reading some here, but they ultimately since they’re closer to novels than dense blogs post/science articles, contain maybe 1/20th of the direct descriptions of how the [special tokens] should behave.
We’re planning to followup with more work on deep character training to explore this difference directly.
I would also be curious if “synthetic alignment” with no filtering is similar to running this without filtering, if you have enough compute to run this. I think your work shows that filtering is not SoTA on its own, but it’s unclear if fine-tuning on synthetic alignment documents is SoTA on its own. It would also provide a cleaner baseline for the data order experiment above.
We have some preliminary results here from a run we botched that didn’t make it into this version (but it seems like the positive synthetic data worked well even with the unfiltered pretraining model). I agree that this would be a clean comparison and hopefully will have updated results here in the new year.
Hey Fabien, thanks!
This is definitely we’re something we’re excited to look at in the coming months—looking at exactly how much data is needed, differences between adding this in pretraining, midtraining, etc. One reason why it may be nice to add this in pretraining / midtraining rather than post-training is that you may want to save your last 100k steps of post-training for capabilities, since subsequent finetuning often degrade capabilities we care about (overall, the last steps of training seem to be fairly expensive real estate).
Additionally, starting post-training with good alignment priors such that there is a “robust, stable basin of attraction” should be useful for avoiding the selection of misaligned personas that alignment fake through training, or simply make subsequent alignment training more difficult than it has to be.
I think the reason why [special token] training underperforms our synthetic dataset is that our main positive dataset included in the main paper was extremely information dense compared to our [special token] training, which took the form of long-form stories from hyperstition. The stories were beautiful, and I would even recommend reading some here, but they ultimately since they’re closer to novels than dense blogs post/science articles, contain maybe 1/20th of the direct descriptions of how the [special tokens] should behave.
We’re planning to followup with more work on deep character training to explore this difference directly.
We have some preliminary results here from a run we botched that didn’t make it into this version (but it seems like the positive synthetic data worked well even with the unfiltered pretraining model). I agree that this would be a clean comparison and hopefully will have updated results here in the new year.