What if instead of mixing everything together, we trained for each property explicitly, in stages, with unambiguous signals at each step?
Doesn’t this ultimately result in the same competing objectives, plus the empirical problem of catastrophic forgetting?
Edit: Ah, but I see that you say at the end of the post “start with step 1, then add step 2 data while keeping step 1 data in the mix”. I don’t know, maybe this works to prevent forgetting, but my guess is that there are simply too many small, implicit rules that we currently train for during RLHF. It seems highly unlikely that you’d be able to cleanly decompose all of those human-preference constraints into a finite number of discrete stages without them clashing. But yeah, this is ultimately an empirical question.
Agreed that this is unclear. I think we should at least try. This (or similar) procedure could lead to more control down the line because you are building on top of a thing that understand the distinction between “This part setup how you should behave” vs “That part setup what we are talking about right now”
Doesn’t this ultimately result in the same competing objectives, plus the empirical problem of catastrophic forgetting?
Edit: Ah, but I see that you say at the end of the post “start with step 1, then add step 2 data while keeping step 1 data in the mix”. I don’t know, maybe this works to prevent forgetting, but my guess is that there are simply too many small, implicit rules that we currently train for during RLHF. It seems highly unlikely that you’d be able to cleanly decompose all of those human-preference constraints into a finite number of discrete stages without them clashing. But yeah, this is ultimately an empirical question.
Agreed that this is unclear. I think we should at least try. This (or similar) procedure could lead to more control down the line because you are building on top of a thing that understand the distinction between “This part setup how you should behave” vs “That part setup what we are talking about right now”