williawa: I feel like a large part of the pitch for deliberative alignment is that we get to shape the model’s reasoning about the spec.
For example, currently we train models to reason in a good way about math problems, or to reason in a desired way about the spec that we hope they’ll follow. It’s not obvious that we’ll be able to do training that affects model reasoning like this if models have opaque reasoning though, because we can’t just write the reasoning ourselves and do SFT on the reasoning trace.
I was wrong about math problems—oops! We typically just supervise on the outputs for math. Thanks for catching this.
For the “reasoning about the spec” thing, I’m confused why the filtering step isn’t necessary.
This has the same effect as putting the spec in context, more or less.
Another way of thinking about it is that it teaches to model what the spec is.
RL (doesn’t need to look at CoT)
This has the effect of teaching the model to follow the spec and to reason about the spec correctly
Both parts can be done without looking at the reasoning traces. The filtering stage is more for efficiency. Like I’ve done deliberative alignment experiments, and when you do step 1 without rejection sampling, you get CoTs that mention relevant parts of the spec 80-90% of the time.
Then if you train on it, you get a model that mentions it 80-90% of the relevant times. This is mostly fine. Its just a bit less robust. It usually brings up relevant parts of the spec in obvious instances. But eg, if you do the deliberative alignment in 1-turn only, and generalize to multi-turn, it might only mention the spec 50% of the time its relevant, but will do it 90% of the time if you do the rejection sampling.
Point being that its not a fundamentally required part of the technique, just used for robustness and reliability.
Like to make it more clear, another interpretation is that deliberative alignment / consitutional AI, is just like RLHF, except you give the model some conceptual clothes-hangers, so that it can better understand why it should do a given thing.
From that perspective, its obvious why deliberative alignment would work even without looking at the reasoning trace.
williawa: I feel like a large part of the pitch for deliberative alignment is that we get to shape the model’s reasoning about the spec.
I was wrong about math problems—oops! We typically just supervise on the outputs for math. Thanks for catching this.
For the “reasoning about the spec” thing, I’m confused why the filtering step isn’t necessary.
My interpretation of DA is
Context distillation (completely unsupervised)
This has the same effect as putting the spec in context, more or less.
Another way of thinking about it is that it teaches to model what the spec is.
RL (doesn’t need to look at CoT)
This has the effect of teaching the model to follow the spec and to reason about the spec correctly
Both parts can be done without looking at the reasoning traces. The filtering stage is more for efficiency. Like I’ve done deliberative alignment experiments, and when you do step 1 without rejection sampling, you get CoTs that mention relevant parts of the spec 80-90% of the time.
Then if you train on it, you get a model that mentions it 80-90% of the relevant times. This is mostly fine. Its just a bit less robust. It usually brings up relevant parts of the spec in obvious instances. But eg, if you do the deliberative alignment in 1-turn only, and generalize to multi-turn, it might only mention the spec 50% of the time its relevant, but will do it 90% of the time if you do the rejection sampling.
Point being that its not a fundamentally required part of the technique, just used for robustness and reliability.
Like to make it more clear, another interpretation is that deliberative alignment / consitutional AI, is just like RLHF, except you give the model some conceptual clothes-hangers, so that it can better understand why it should do a given thing.
From that perspective, its obvious why deliberative alignment would work even without looking at the reasoning trace.
Thanks, this is good to know. I’ll adjust my post.