In your comment you bring up the question “Is training the reasoning trace to look nice a bad idea”?
I want to defend some forms of training the reasoning trace to look nice.
For instance, I think it’d be totally reasonable to impose some structure on the reasoning such as:
Selecting for reasoning traces which appear to reason about the spec.
Selecting for reasoning traces that are human interpretable (i.e., penalizing the AI if it starts using words in weird ways).
I think I’m probably more into using SFT to shape reasoning than RL.
I agree that “train the AI to not mention reward hacking in it’s CoT” won’t work if you do it the most naive way, as OAI found https://openai.com/index/chain-of-thought-monitoring/. But I actually think it’s plausible that if you applied more pressure to keep the CoT faithful (which I don’t have great ideas for how to do atm), this could work.
I agree there are reasons to do it. But I disagree with the specific examples you gave to be honest.
My pitch for doing training on reasoning-traces is this: Aligning deep learning systems is hard, because we train on outputs, and therefore can’t know the process that produced those outputs. But CoT is an exception to this. The CoT seems in some sense load-bearing. Which means, intervening on it, might allow us to change models behavior by changing why they do certain things.
But I think this is very risky. And has to be used with a great degree of care. And I’d prefer only doing it for stuff that really matters, like alignment.
In general I think training on reasoning traces should mostly be done if you can:
do it early in training, when the model is dumber and less situationally aware
Better to do it “softly”, ie, RL / rejection sampling (which can be thought of as a crappy version of RL), where you’re mostly training the model to display more of a behavior it already sometimes displays anyways
(rather than eg, gathering (query, CoT, answer) tuples, then rewriting the CoT to CoT* and training on (query, CoT*, answer)
Better to do it for important things. Ie using it to instill aligned goals in the first place, rather than making CoT look more legible or to reason about a spec exactly the way you want
Better to do it a little than to do it a lot, and better to do it in limited domains rather than doing it in a variety of situations
But I actually think it’s plausible that if you applied more pressure to keep the CoT faithful (which I don’t have great ideas for how to do atm), this could work.
I mean, if we had a way to guarantee that CoT remained faitful, even as we do random tampering with it, that would be great news. We could just train the CoT to look nice and aligned. And we’d have solved at least half of alignment that way.
I agree that we should be pretty careful with how we do this, and should mostly avoid doing it for now.
But it seems potentially good to do research on process-based supervision for reasoning.
In your comment you bring up the question “Is training the reasoning trace to look nice a bad idea”?
I want to defend some forms of training the reasoning trace to look nice.
For instance, I think it’d be totally reasonable to impose some structure on the reasoning such as:
Selecting for reasoning traces which appear to reason about the spec.
Selecting for reasoning traces that are human interpretable (i.e., penalizing the AI if it starts using words in weird ways).
I think I’m probably more into using SFT to shape reasoning than RL.
I agree that “train the AI to not mention reward hacking in it’s CoT” won’t work if you do it the most naive way, as OAI found https://openai.com/index/chain-of-thought-monitoring/. But I actually think it’s plausible that if you applied more pressure to keep the CoT faithful (which I don’t have great ideas for how to do atm), this could work.
I agree there are reasons to do it. But I disagree with the specific examples you gave to be honest.
My pitch for doing training on reasoning-traces is this: Aligning deep learning systems is hard, because we train on outputs, and therefore can’t know the process that produced those outputs. But CoT is an exception to this. The CoT seems in some sense load-bearing. Which means, intervening on it, might allow us to change models behavior by changing why they do certain things.
But I think this is very risky. And has to be used with a great degree of care. And I’d prefer only doing it for stuff that really matters, like alignment.
In general I think training on reasoning traces should mostly be done if you can:
do it early in training, when the model is dumber and less situationally aware
Better to do it “softly”, ie, RL / rejection sampling (which can be thought of as a crappy version of RL), where you’re mostly training the model to display more of a behavior it already sometimes displays anyways
(rather than eg, gathering (query, CoT, answer) tuples, then rewriting the CoT to CoT* and training on (query, CoT*, answer)
Better to do it for important things. Ie using it to instill aligned goals in the first place, rather than making CoT look more legible or to reason about a spec exactly the way you want
Better to do it a little than to do it a lot, and better to do it in limited domains rather than doing it in a variety of situations
I mean, if we had a way to guarantee that CoT remained faitful, even as we do random tampering with it, that would be great news. We could just train the CoT to look nice and aligned. And we’d have solved at least half of alignment that way.
I agree that we should be pretty careful with how we do this, and should mostly avoid doing it for now. But it seems potentially good to do research on process-based supervision for reasoning.