Thanks Cam! I agree that this could be a nice testbed for assessing the generalisation ability of the alignment techniques, and that it would be really interesting to test the SDF on why/stories/high-quality SFT on this dataset, which seem to induce better generalisation. The inspiration for this post was actually the intuition that DPO/SFT-based techniques are brittle in some sense that SDF isn’t, so the natural next step after seeing the former is to test the latter! I will note, however, that I am unsure on how well these would work on such small (<10B) models.
Thanks Cam!
I agree that this could be a nice testbed for assessing the generalisation ability of the alignment techniques, and that it would be really interesting to test the SDF on why/stories/high-quality SFT on this dataset, which seem to induce better generalisation. The inspiration for this post was actually the intuition that DPO/SFT-based techniques are brittle in some sense that SDF isn’t, so the natural next step after seeing the former is to test the latter!
I will note, however, that I am unsure on how well these would work on such small (<10B) models.