Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 28 Jan 2025 23:06 UTC

19 points

Brief thoughts on Deliberative Alignment in response to being asked about it

We first train an o-style model for helpfulness, without any safety-relevant data.
We then build a dataset of (prompt, completion) pairs where the CoTs in the completions reference the specifications. We do this by inserting the relevant safety specification text for each conversation in the system prompt, generating model completions, and then removing the system prompts from the data.
We perform incremental supervised fine-tuning (SFT) on this dataset, providing the model with a strong prior for safe reasoning. Through SFT, the model learns both the content of our safety specifications and how to reason over them to generate aligned responses.
We then use reinforcement learning (RL) to train the model to use its CoT more effectively. To do so, we employ a reward model with access to our safety policies to provide additional reward signal.

My summary as I currently understand it:

Pretrain
Helpful-only Agent (Instruction-following and Reasoning/Agency Training)
Deliberative Alignment
1. Create your reward model: Take your helpful-only agent from step 2 and prompt it with “is this an example of following these rules [insert spec]?” or something like that
2. Generate some attempted spec-followings: Take your helpful-only agent from step 2 and prompt it with “Follow these rules [insert spec], think step by step in your CoT before answering.” Feed it a bunch of example user inputs.
3. Evaluate the CoT generations using your reward model, and then train another copy of helpful-only agent from 2 on that data using SFT to distill the highest-evaluated CoTs into it. (removing the spec part of the prompt) That way, it doesn’t need to have the spec in the context window anymore, and also, it ‘reasons correctly’ about the spec (i.e. reasons in ways the RM would have most approved of)
4. Take the resulting agent and do RL on it using the same RM from B. This time, the RM doesn’t get to see the CoT.
Deploy the resulting agent.

So there’s a chicken and egg problem here. You are basically using a model to grade/evaluate/train itself. Obvious problem is that if the initial model is misaligned, the resulting model might also be misaligned. (In fact it might even be *more* misaligned).

I can imagine someone responding “no problem we’ll just do the same thing but with a different model as the reward model, perhaps a more trusted model.” So let me break down three categories of strategy:

Train your smart agent with evaluations produced by humans and/or dumber models	Severe oversight failures + Inductive bias worries + possible confusing drift dynamics
Train your smart agent with evaluations produced by different-but-similarly-capable models.	Less severe oversight failures + Inductive bias worries + mild chicken-and-egg problem + possible confusing drift dynamics
Train your smart agent with evaluations produced by itself (or something similar enough to be relevantly similar to itself)	Severe chicken-and-egg problem + possible confusing drift dynamics

See below for explanation of each term/phrase.

I’m most excited about the second kind of strategy (which includes IDA and W2SG for example). Deliberative Alignment and Constitutional AI seem to be either the first or the third.

Severe oversight failures: The smart agent will often find itself having to consciously and/or unconsciously choose between being aligned and getting reinforced, e.g. when it notices opportunities to lie, manipulate, or cheat and get away with it. See basically all the classic literature e.g. Ajeya Cotra’s training game report.
Less severe oversight failures: Being overseen by an equally-competent agent helps with a lot of things, but doesn’t completely solve the oversight failure problem—e.g. I might be just as competent as you, but since I can’t read your mind I can’t tell if you are lying. I might not even be able to tell if the code you are writing contains a security vulnerability or an applause light (i.e. something that isn’t actually good code but which appears to human raters to be good code) though I’m much more likely to be able to tell than if I was dumber than you.
Severe chicken-and-egg problem: You don’t really have oversight failures if the overseer is exactly or almost exactly the same model. Because e.g. it knows exactly what the agent model was thinking when it output X, so it knows if the agent model was being dishonest. (Though now that I say that out loud, I realize it’s not strictly true, there could still be oversight failures. But they seem much less severe at least.) However, the chicken and egg problem is severe. Somehow you need to get the overseer-variant of the model to e.g. do its best to penalize dishonesty, penalize cheating, etc. despite not itself being disinclined to lie and cheat! (If it already was discinclined to lie and cheat, then great, but how did that happen?) Perhaps this can work e.g. if different copies of the same model have different goals (perhaps because they are selfish / have indexical preferences).
Mild chicken-and-egg problem: When you are training a new-generation AI system that is more capable than your current stuff, how do you align it, if your alignment strategy involves using similarly-capable models to do the evaluation? This paradox may have some answers; perhaps W2SG is an example of one. (Use the non-situationally-aware base model!) Another possible answer is IDA maybe (use a slightly earlier version of the model but with more time to think / more scaffolding)
Inductive bias worries: If your Spec is complicated enough, the model might internalize the Spec in the world-model rather than in the goal-slot, where the thing in the goal-slot is something simpler than the Spec. If you are lucky it’ll be a pointer to the Spec that is robust to future distribution shifts; if you aren’t, it’ll be a nonrobust pointer, if you are extra unlucky it’ll be some other goalset (including proxies, ICGs, and shards or variants of the Spec) combined with instrumental convergence. Ditto for your Intentions. See the mesaoptimizers paper “Risks from Learned Optimization.” (Also I think this doesn’t actually depend on there being a goal slot, the goal slot model is just a helpful simplification.)
Possible confusing drift dynamics: This chain of AIs training AIs training AIs… seems like the sort of thing that could have all sorts of interesting or even chaotic drift dynamics. I don’t have much to say here other than that. If we understood these drift dynamics better perhaps we could harness them to good effect, but in our current state of ignorance we need to add this to the list of potential problems.