Charbel-Raphaël comments on Three Properties for Alignment (and Why We’re Not Training Them)

Charbel-Raphaël 17 Mar 2026 19:49 UTC
3 points
0
What if instead of mixing everything together, we trained for each property explicitly, in stages, with unambiguous signals at each step?
Doesn’t this ultimately result in the same competing objectives, plus the empirical problem of catastrophic forgetting?
Edit: Ah, but I see that you say at the end of the post “start with step 1, then add step 2 data while keeping step 1 data in the mix”. I don’t know, maybe this works to prevent forgetting, but my guess is that there are simply too many small, implicit rules that we currently train for during RLHF. It seems highly unlikely that you’d be able to cleanly decompose all of those human-preference constraints into a finite number of discrete stages without them clashing. But yeah, this is ultimately an empirical question.
- Quentin FEUILLADE--MONTIXI 17 Mar 2026 21:55 UTC
  1 point
  0
  Parent
  Agreed that this is unclear. I think we should at least try. This (or similar) procedure could lead to more control down the line because you are building on top of a thing that understand the distinction between “This part setup how you should behave” vs “That part setup what we are talking about right now”