SFT on calm response data was ineffective. We trained for 2 epochs on 650 calm responses mixed with 500 samples of standard instruct data. In one iteration (SFT teacher, described in the paper) this actually marginally increased expressed distress, seemingly as a result of making responses much more verbose.
DPO on 280 preference pairs was highly effective.
This matches what I’ve observed in people. Approaches that rhyme with “imitate calm behavior” tend to be flaky if not ineffective. The approaches that last tend to instead be about unlearning insecure feelings/behaviors
This matches what I’ve observed in people. Approaches that rhyme with “imitate calm behavior” tend to be flaky if not ineffective. The approaches that last tend to instead be about unlearning insecure feelings/behaviors
What makes DPO analogous to unlearning?
It’s definitely not the most unlearning-ish algorithm there could be, but targeting unwanted responses directly is closer than not doing it