Hi, thanks for your comment!
You are right to flag this. I think the hypothesis I was testing (character struggles to generalise OOD) is supported by this set of experiments (since both the email-body and agent scaffolding constitute OOD), but the precise cause is not. I ran an additional experiment for this:
The model is given the same system prompt with the agentic stuff ablated, and instructed to draft the email (versus send the email before):
It seems that most of the gap comes from agentic scaffolding, and some comes from the email body. This is consistent with both being OOD elements that would reduce character/trait presence.
Thanks Cam!
I agree that this could be a nice testbed for assessing the generalisation ability of the alignment techniques, and that it would be really interesting to test the SDF on why/stories/high-quality SFT on this dataset, which seem to induce better generalisation. The inspiration for this post was actually the intuition that DPO/SFT-based techniques are brittle in some sense that SDF isn’t, so the natural next step after seeing the former is to test the latter!
I will note, however, that I am unsure on how well these would work on such small (<10B) models.