I agree there are reasons to do it. But I disagree with the specific examples you gave to be honest.
My pitch for doing training on reasoning-traces is this: Aligning deep learning systems is hard, because we train on outputs, and therefore can’t know the process that produced those outputs. But CoT is an exception to this. The CoT seems in some sense load-bearing. Which means, intervening on it, might allow us to change models behavior by changing why they do certain things.
But I think this is very risky. And has to be used with a great degree of care. And I’d prefer only doing it for stuff that really matters, like alignment.
In general I think training on reasoning traces should mostly be done if you can:
do it early in training, when the model is dumber and less situationally aware
Better to do it “softly”, ie, RL / rejection sampling (which can be thought of as a crappy version of RL), where you’re mostly training the model to display more of a behavior it already sometimes displays anyways
(rather than eg, gathering (query, CoT, answer) tuples, then rewriting the CoT to CoT* and training on (query, CoT*, answer)
Better to do it for important things. Ie using it to instill aligned goals in the first place, rather than making CoT look more legible or to reason about a spec exactly the way you want
Better to do it a little than to do it a lot, and better to do it in limited domains rather than doing it in a variety of situations
But I actually think it’s plausible that if you applied more pressure to keep the CoT faithful (which I don’t have great ideas for how to do atm), this could work.
I mean, if we had a way to guarantee that CoT remained faitful, even as we do random tampering with it, that would be great news. We could just train the CoT to look nice and aligned. And we’d have solved at least half of alignment that way.
I agree that we should be pretty careful with how we do this, and should mostly avoid doing it for now.
But it seems potentially good to do research on process-based supervision for reasoning.
I agree there are reasons to do it. But I disagree with the specific examples you gave to be honest.
My pitch for doing training on reasoning-traces is this: Aligning deep learning systems is hard, because we train on outputs, and therefore can’t know the process that produced those outputs. But CoT is an exception to this. The CoT seems in some sense load-bearing. Which means, intervening on it, might allow us to change models behavior by changing why they do certain things.
But I think this is very risky. And has to be used with a great degree of care. And I’d prefer only doing it for stuff that really matters, like alignment.
In general I think training on reasoning traces should mostly be done if you can:
do it early in training, when the model is dumber and less situationally aware
Better to do it “softly”, ie, RL / rejection sampling (which can be thought of as a crappy version of RL), where you’re mostly training the model to display more of a behavior it already sometimes displays anyways
(rather than eg, gathering (query, CoT, answer) tuples, then rewriting the CoT to CoT* and training on (query, CoT*, answer)
Better to do it for important things. Ie using it to instill aligned goals in the first place, rather than making CoT look more legible or to reason about a spec exactly the way you want
Better to do it a little than to do it a lot, and better to do it in limited domains rather than doing it in a variety of situations
I mean, if we had a way to guarantee that CoT remained faitful, even as we do random tampering with it, that would be great news. We could just train the CoT to look nice and aligned. And we’d have solved at least half of alignment that way.
I agree that we should be pretty careful with how we do this, and should mostly avoid doing it for now. But it seems potentially good to do research on process-based supervision for reasoning.