The LW “import from google doc” feature should only import one tab of a document.
Alek Westover
Advice for making robust-to-training model organisms
Why does off-model SFT degrade capabilities?
Incriminating misaligned AI models via distillation
Sleeper Agent Backdoor Results Are Messy
Actually there are probably 4 types of auditing. A 4th type is where you do auditing to detect the issue, and then try to patch it via outer loop methods rather than inner loop methods.
An Empirical Study of Methods for SFTing Opaque Reasoning Models
How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?
Three reasons to do auditing in diffuse control
There are three reasons you might want to do auditing in diffuse control:
Strong incrimination—gather compelling evidence that the AI is egregiously misaligned, in hopes that this catalyzes a drastic change in the AI company’s orientation towards misalignment risk (e.g., move from plan D to plan C).
Weak incrimination—gather evidence that the AI is misaligned in some problematic way, in hopes of motivating the lab to do something moderately costly (e.g., adopt a certain annoying training measure in attempts to control the misaligned AI).
Obtaining a training signal that can be used to cause the AI to do less of the undesirable action in the future (and possibly also generalize to removing the misalignment that caused the bad action).
The auditing that you want to do in pursuit of these objectives is somewhat different in nature.
Strong incrimination seems pretty unlikely to be viable, and I won’t comment more on it.
For weakly incriminating the AI, I think the most promising approach is to look for:
Strong concerning statistical trends.
You could try to come up with a way of grading the model’s research decisions in projects, and show that the decisions are much lower quality during safety research.
Show that the AI has below trend capabilities in areas related to safety research despite having elevated capabilities in lots of other domains.
Clean ablations indicating that the model is differentially sandbagging on safety research.
For instance, if you had a bunch of instances where framing a task as safety research vs capabilities research changed the AI’s performance on the task, that might be compelling.
It’s tempting to imagine that we’ll be able to distinguish
Benign mistakes
“Reward hacking”
Sabotage motivated by egregious misalignment.
However, my current best guess is that this is basically going to be impossible. Thus, I think that trying to prove that an error in a research project was intentional is not useful for weak incrimination.
Here are some ideas on how to obtain training signal that you’ll use to remove misaligned behavior:
Sample random projects, weighted towards more influential / important projects. Replicate these projects with trusted + human labor (at least partially) to see if the overall result is correct.
For the projects that don’t replicate, try to pinpoint where the problem is and find a proposed patch.
It’s not crucial that we pinpoint where the problem is; more surgically pointing to it seems like it could be useful, because it allows us to more aggressively target the behavior.
It also isn’t too bad here to have false positives, and it’s not an issue that we likely can’t distinguish between benign mistakes, reward hacking and motivated sabotage. However, decreasing false positives and more precisely targeting actions downstream of misalignment would be better.
In addition to auditing projects that the AI did, we can audit “sabotage honey pots”—research problems that have a particularly easy to think of sabotage, and see if the AI does this sabotage. You’ll know if the AI messed up because you already have the ground truth, and it might be easy to isolate the place where it messed up due to knowing the intended sabotage.
One final way that we could obtain a training signal is by generating somewhat off-policy sabotage attempts. For instance, we could prompt / train a copy of the AI to sabotage a research project and then train the AI against this off-policy sabotage data. (Note however that this off-policy data might be much less effective than on-policy data).
Note that once we obtain training signal it’s unclear whether it’d be best to train the AI directly against this sabotage, or to just train automated “sabotage” detectors, and then rejection sample against these to get good research. It could be better to just train better detectors, because maybe caught research errors are not very costly, whereas making the AI generally more sneaky (as a consequence of training against times when we catch it doing bad things) is very bad. However, it’s also plausible that the generalization from training against sabotage is quite favorable and desirable.
Summary
I think it’s helpful to keep in mind that incrimination probably relies on finding broad concerning trends in AI behavior that aren’t well explained as reward hacking or incompetence, whereas obtaining a good training signal for removing sabotage looks more like finding a lot of particular examples of bad behavior, where “false positives” are not highly problematic.
Five approaches to evaluating training-based control measures
Model organisms researchers should check whether high LRs defeat their model organisms
Responding to your point about low-stakes control:
I weakly predict that sandbagging at ~human level (and even a bit after this) will be hard for AI models. Here are some reasons why I’m optimistic about this:
Sandbagging to worse than human level seems really hard:
-
There are a bunch debate-flavored techniques that could let us do self-supervision.
You might be worried about the AI’s colluding if we do self-supervision, but training can incentivize the AI’s to defect against each other, and collusion might be hard especially if we have some information shielding.
-
Weak to strong generalization should mean that the actual performance of the AI is better than our reward signals, because there should be some errors in our reward signal that are hard for the AI to model.
-
One reason that exploration hacking might be hard is that many tasks exist on a continuum (h/t Vivek). Thus, the AI will probably have some task-difficulty level where the AI is unsure whether or not it should exploration hack. If it sandbags too aggressively, training will remove the sandbagging. If it sandbags not aggressively enough then that’s what we wanted in the first place. But if there’s some region in task-difficulty space where the AI is not consistently sandbagging, then RL should just work on this regime, and this should push upwards the frontier of problem difficulty level where the AI is unsure whether it should sandbag. Then you could iterate. This could solve exploration hacking (but not the fact that our reward signals are bad).
Off-policy data could possibly be mitigated in several ways:
We could do behavioral cloning (e.g., Dagger).
You could use a reference solution to guide or grade an AI’s solution.
We could generate a bunch of human training data on doing alignment—for instance, we could ask Anthropic alignment science researchers to record themselves doing alignment research.
Plausibly the fact that sandbagging at human level is hard is sufficient—Hopefully getting to handoff doesn’t require paradigm shifting research. If it does, we might be in more trouble.
-
I agree that we should be pretty careful with how we do this, and should mostly avoid doing it for now. But it seems potentially good to do research on process-based supervision for reasoning.
Yeah, I think something like this could plausibly work, I’m just pretty unsure about how well it will work.
Thanks, this is good to know. I’ll adjust my post.
Is your proposal that we have a different (possibly weaker / “trusted”) model write a reasoning trace which is part English and part image tokens and stuff, and then we “SFT” on that? I feel like the issue is that a model of a different family might not know the right concept space to write to.
In your comment you bring up the question “Is training the reasoning trace to look nice a bad idea”?
I want to defend some forms of training the reasoning trace to look nice.
For instance, I think it’d be totally reasonable to impose some structure on the reasoning such as:
Selecting for reasoning traces which appear to reason about the spec.
Selecting for reasoning traces that are human interpretable (i.e., penalizing the AI if it starts using words in weird ways).
I think I’m probably more into using SFT to shape reasoning than RL.
I agree that “train the AI to not mention reward hacking in it’s CoT” won’t work if you do it the most naive way, as OAI found https://openai.com/index/chain-of-thought-monitoring/. But I actually think it’s plausible that if you applied more pressure to keep the CoT faithful (which I don’t have great ideas for how to do atm), this could work.
williawa: I feel like a large part of the pitch for deliberative alignment is that we get to shape the model’s reasoning about the spec.
For example, currently we train models to reason in a good way about math problems, or to reason in a desired way about the spec that we hope they’ll follow. It’s not obvious that we’ll be able to do training that affects model reasoning like this if models have opaque reasoning though, because we can’t just write the reasoning ourselves and do SFT on the reasoning trace.
I was wrong about math problems—oops! We typically just supervise on the outputs for math. Thanks for catching this.
For the “reasoning about the spec” thing, I’m confused why the filtering step isn’t necessary.
Roger: I’m much more bearish on the encode/decode being possible. Sometimes I think in terms of images or video! Other times I have intuitions that I can’t put into words when pressed. Also, there might just not be human interpretable words for explaining what’s going on in the AI’s computation.
@SebastianP I feel like that doesn’t make any sense—shouldn’t this type of DPO clearly make the AI talk like a pirate? Did it just not transfer very broadly or something?