Chris Lakin comments on Gemma Needs Help

Chris Lakin 10 Mar 2026 22:40 UTC
9 points
−2
SFT on calm response data was ineffective. We trained for 2 epochs on 650 calm responses mixed with 500 samples of standard instruct data. In one iteration (SFT teacher, described in the paper) this actually marginally increased expressed distress, seemingly as a result of making responses much more verbose.
DPO on 280 preference pairs was highly effective.
This matches what I’ve observed in people. Approaches that rhyme with “imitate calm behavior” tend to be flaky if not ineffective. The approaches that last tend to instead be about unlearning insecure feelings/behaviors
- Adele Lopez 10 Mar 2026 22:56 UTC
  6 points
  0
  Parent
  What makes DPO analogous to unlearning?
  - Chris Lakin 10 Mar 2026 22:58 UTC
    4 points
    0
    Parent
    It’s definitely not the most unlearning-ish algorithm there could be, but targeting unwanted responses directly is closer than not doing it