Yeah, I think something like this could plausibly work, I’m just pretty unsure about how well it will work.
Alek Westover
Thanks, this is good to know. I’ll adjust my post.
Is your proposal that we have a different (possibly weaker / “trusted”) model write a reasoning trace which is part English and part image tokens and stuff, and then we “SFT” on that? I feel like the issue is that a model of a different family might not know the right concept space to write to.
In your comment you bring up the question “Is training the reasoning trace to look nice a bad idea”?
I want to defend some forms of training the reasoning trace to look nice.
For instance, I think it’d be totally reasonable to impose some structure on the reasoning such as:
Selecting for reasoning traces which appear to reason about the spec.
Selecting for reasoning traces that are human interpretable (i.e., penalizing the AI if it starts using words in weird ways).
I think I’m probably more into using SFT to shape reasoning than RL.
I agree that “train the AI to not mention reward hacking in it’s CoT” won’t work if you do it the most naive way, as OAI found https://openai.com/index/chain-of-thought-monitoring/. But I actually think it’s plausible that if you applied more pressure to keep the CoT faithful (which I don’t have great ideas for how to do atm), this could work.
williawa: I feel like a large part of the pitch for deliberative alignment is that we get to shape the model’s reasoning about the spec.
For example, currently we train models to reason in a good way about math problems, or to reason in a desired way about the spec that we hope they’ll follow. It’s not obvious that we’ll be able to do training that affects model reasoning like this if models have opaque reasoning though, because we can’t just write the reasoning ourselves and do SFT on the reasoning trace.
I was wrong about math problems—oops! We typically just supervise on the outputs for math. Thanks for catching this.
For the “reasoning about the spec” thing, I’m confused why the filtering step isn’t necessary.
Roger: I’m much more bearish on the encode/decode being possible. Sometimes I think in terms of images or video! Other times I have intuitions that I can’t put into words when pressed. Also, there might just not be human interpretable words for explaining what’s going on in the AI’s computation.
J Bostock: The technique you describe here seems like it’s very vulnerable to the decoder model colluding with the policy (and this seems super likely if they’re the same model!).
Something like this seems more promising if we:
Freeze the decoder / encoder
During post-training, we add a term for the reconstruction loss “decode paraphrase and then re-encode doesn’t totally wreck performance”.
But I’m skeptical of (2) working because maybe the encode / decode is just really hard to do well.
How will we do SFT on models with opaque reasoning?
Three visions for diffuse control
Online training makes info sec / siloing harder.
No one has done this. I think someone is working on it.
Hi Mike: Training for multiple epochs on the datapoints should totally be fair game, although I wouldn’t have guessed it to be necessary. I’ll try that.
Does the error for SGD assume single-pass
I’m not quite sure how to apply the SGD bound here. The SGD bound is only about optimizing convex functions. The function we care about is:
which is not convex.
(. = dot product, delta = 1 if args are the same, 0 else).
Theoretical predictions on the sample efficiency of training policies and activation monitors
Another downside not explicitly mentioned here is catastrophic forgetting.
Three Upsides of Training Policies Online
In an earlier post I listed some potential downsides to online training of policies. In this post, I’ll list some reasons why we might want to train policies online anyways.
1: Detecting bad behavior might not suffice to elicit good behavior
If you just train monitors to detect bad behavior, and don’t train the policy not to do bad things, then you have about three options for what to do once you detect bad behavior:
A. Try to incriminate the model.
B. Resample from the model.
C. Use some other process to complete the task (e.g., ask a weaker model to do the task, try posing the task in a different way).
However:
(A) might be hard
(B) fails if the model is attacking consistently. Even if the model isn’t attacking super consistently, this resampling is still costly and/or increases latency.
(C) May produce a lower quality answer.
So, if we want good behavior on the task, we might need to train the policy, and not just train monitors.
2: Training policies might generalize
It might be the case that, while we have a hard time detecting sabotage (even after catching a couple examples of it), training the model not to sabotage on a couple examples generalizes to broadly removing the sabotage propensity.
3: Training the policy adds another layer to a defense in depth
Training a policy online should constraint it to have a relatively small attack rate. If the failures of your monitor are somewhat decorrelated with the times when the policy chooses to attack, then by training both the policy and the monitor, we have a system which attacks less overall.
Vivek and I have found a few dimensions along which control situations frequently differ. The dimensions we usually think about are:
Is the AI sandbagging or performing sabotage? (Sandbagging is generally less harmful and less incriminating—it’s just the AI not trying very hard. Sabotage is the AI doing something that a benign model is very unlikely to do; it generally causes more damage, and is more likely to be incriminating).
Is there a small set of places where attacks by the AI are particularly leveraged, or is it equally harmful for it to attack at any time? (“Concentration of importance”).
Can we discriminate attacks from benign behavior, if we are able to look at them with expensive oversight (e.g., humans)? (Note that incrimination isn’t trivial even if attacks have high discriminability to a human, because it might be hard for weaker models to sift these out of the huge volume of actions that the AI model takes.)
Is shutdown an option?
These dimensions create several different regimes of control problems. Here are some regimes that we think are especially important:
“Uniform sandbagging” (studied by Emil here): sandbagging, uniform importance of tasks, no ability to incriminate the AI on attacks, or to shutdown.
“Diffuse control” (I think this is the central case in diffuse control; I’m open to changing my mind though): Low discriminability, moderate concentration of importance, no shutdown.
“High-stakes control” (see, e.g., CtrlZ): sabotage, high discriminability, shutdown is an option.
“Few-shot catastrophe prevention”: sabotage, high discriminability, shutdown is not an option.
I agree. I was mostly making the claim that on small time-scales (e.g., one hour at a whiteboard) it can make sense to have one G and one R (although you shouldn’t be too rigid about it imo).
@StanislavKrym: My impression is that pre-deployment evals currently take orgs like METR a few weeks. If this is still about the amount of time that pre-deployment evals take, then optimistically (this might be pretty expensive in practice) you’d be deploying models that haven’t been trained online for the past several weeks but have been audited. It’s unclear how big an issue this would be in practice.
I agree that we should be pretty careful with how we do this, and should mostly avoid doing it for now. But it seems potentially good to do research on process-based supervision for reasoning.