Yep we discuss that general idea in this section: “Reason 1: Capabilities and misalignment might be too tightly linked”, tho we didn’t mention the reasoning you gave for why powerful LLMs might be more likely to be misaligned than weak ones.
SebastianP
With that in mind, is there a clear criterion for dividing AI into ‘Teacher’ and ‘Student’ roles?
In general, the teacher will be the strongest model available to a lab that they don’t trust (else they’d just use that model). The student will be some weaker model that we’re much more confident we can trust, due to some combination of thinking it’s too dumb to be competently scheming against us, and having done heavy audits on it. In general these are referred to as U (for untrusted) and T (for trusted). We expect U to be broadly more capable than T, in the same way that Opus 4.8 is broadly more capable than Opus 3.
Furthermore, in Idea 4, you mentioned capping the Student AI’s intelligence at a level where alignment audits are still viable(i.e. auditable). It seems to me that we would still need some kind of metric to measure the LLM’s intelligence for this purpose. We can’t simply rely on benchmark scores, as we’ve seen far too many instances where they bear little relation to an AI’s actual ability to solve real-world problems.
Yeah this is a hard problem to solve. Basically we need to be able to quantify “how likely is it that this model could be subverting all our alignment evals”. Some ways we could do this are measuring capabilities required for doing that, like secret keeping, latent time horizon (i.e. ability to reason opaquely), situational/eval awareness, etc. If a model is bad at enough of these capabilities we might be reasonably justified in saying that it’s auditable.
I’ve found that doing no-reasoning SFT (train on output conditioned on empty reasoning) on reasoning models (like they do in Emil’s paper that you linked) can cook them quite badly. I wasn’t able to prevent the model from completely losing it’s reasoning ability when doing full-weight finetuning on qwen3-32b, and it sometimes reasons weirdly even when trained with LoRA, tho there seems to be some intra-run variance here (it sometimes comes out completely normal).
The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t
How to reduce capability degradation from off-model SFT
Nice, thanks for running this! Hmm I’m not sure I buy the model size claim, seems like you’d probably get severely diminishing returns in terms of “how well the model represents the DPO diff”, since smaller models these days are already seem to have good world models. b makes a lot of sense tho!
Hmm i see, I’m not sure I share the intuition that on-policy is super different from on-model (i.e. with a different prompt), for example, we didn’t see a difference here: https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities
But thanks for suggesting, I think it’s worth a try, if I get some time this week I’ll run the experiment!
Probably a quibble, but I’m imagining this being more diffuse, like making all benign actions 30% more suspicious. So this is a continuous problem, because the more it can get away with this the more the safety scores drop, since probabilistically more benign actions are rated higher than the first real attack. Like, to the extent the monitor ROC curves are already imperfect, this is because benign actions already sometimes look suspicious by default, so one could imagine techniques that move this curve even in a benign model.
I think the training argument still makes sense here? If we’re assuming no exploration hacking problems (which can probably be resolved with a bit of SFT at the beginning), then it seems reasonable that we can get the model to be the lowest possible suspicion it could be (and we proxy this with the benign model.
In general, I think I’m a bit confused why you think the problem with benign stats implicates commented code anymore than it does normal code (e.g. ctrl-z). Maybe there’s a part of your argument I’m misunderstanding?If we’re planning on deploying that U, I’m fairly scared about giving it optimization pressure toward predicting any monitor’s sus scores.
Yep seems very reasonable! I’m also a bit worried about how reasonable this is fwiw.
My own initial naive idea was to just make a long list of design patterns that prioritize legibility of code, like an extended Claude.md or a “model spec for coding”, and train it to follow these patterns (which shouldn’t be too hard to verify).
Yeah that’s actually basically what we do here! But we proxy ‘train bad U to be legible’ with ‘prompt good U to be legible’, which of course abstracts out some concerns related to how good the training signal is, how good generalization is, etc. But I agree with you above that it seems like we’d have good signal for legibility-type things.
Oh interesting! Why do you think this would make it more robust?
Yeah, I’m pretty skeptical about 1. It seems hard to convince lab leadership using such a technique, but honestly the technique isn’t that complicated so maybe it’s doable. I think the basic pitch of “we trained a different model to imitate U and it became evil, which means U was also probably evil, but just hiding it” is actually moderately convincing to laypeople, though the subliminal learning stuff makes it complicated for sure. Either way, I think skepticism is super reasonable here!
As for 2, we originally thought there might be some nice ways to do this. For example, maybe the honeypots that were successful on the distilled model might give us insight into the honeypots that will work on U (in some sense narrowing the hypothesis space for red-teamers). I think this seems pretty hard given U’s strength, which is why we didn’t include this idea in the post. In a very optimistic scenario, we might be able to just have the distilled model tell us about U’s misalignment—for example the distilled model might be able to tell us what type of misaligned U is (schemer, reward seeker, etc), tho I find this very unlikely.
Overall I’m more excited by the benefits of AI safety people taking such evidence seriously. I think this could cause a lot of important actions, such as not trusting safety research by U, improving control protocols and allocating more resources to finding convincing behavioral evidence.
Hey, thanks for the comment, and sending the doc from Vivek (was very informative)!
I think in practice it’s possible to get your model to do bad behaviors (like being obscurantist) less frequently than your audit budget, such that it doesn’t mess up the monitor roc curves.
Concretely, we could do some things like:
- Train against a different (hopefully decorrelated) monitor’s suspicion scores, on U’s usefulness run. Or instead of a monitor’s suspicion score, we could substitute a “legibility score” for comment clarity.
- Use a trusted model/humans to comment over the trajectories they understand and then SFT on thoseIf we did that, combined with sampling trajectories identical to the deployment distribution for the control eval, I think we’d end up in the situation A you label above, except where the summary statistics are sufficiently close to benign (because we trained them to be).
I definitely think it’s not totally clear whether the benign stats assumption is reasonable, but I kinda buy the argument above. Let me know if I’m misunderstanding you, also happy to chat! Also thanks for bringing this up, I hadn’t thought super deeply about this before :)
Glad to hear you liked it :)
If you mean on the red team side, then actually the way we make the backdoored models is through prompt distillation! We generate normal responses with a normal system prompt, and then the backdoored responses with a different system prompt that instructs it to do the behavior (e.g. mention cats). In the case of some backdoors, we’ll synthetically generate the outputs. Then we just prepend the trigger to the user prompts of the backdoored responses and train.
If you meant on the blue team side (e.g. on-policy distillation), we have not! My guess is that things that are on-policy, or have sparse supervision (e.g. RL) will do much worse at training out backdoors.
Advice for making robust-to-training model organisms
This was Qwen3-30b-a3b-instruct-2507. I did lora finetuning with tinker, and generated the pirate/normal data with llama 8b. Oh that’s very interesting huh! I haven’t played around much with DPO but it seems to be a very finnicky technique
I was super surprised by this too, so I tried a few things to try and make it work (training for longer + higher lr, using DPOP, changing KL hyperparameter). The thing that worked best was turning down the beta hyperparameter, which controls the scaling for the kl divergence loss term. This allowed the model to speak like a pirate (tho not super consistently), but it also cooked the model on-distribution (alpaca) and slightly off-distribution (math). For example, the model sometimes just outputs this:
(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(**
Claude explains why (and correctly guessed the result of the experiment): https://claude.ai/share/7c1616db-23c4-4835-b3ba-966456fbfa83
Claude’s solution is to just add an SFT loss term (which runs into the problem from this post)
Yep we have! And you’re right, it doesn’t cause degradation (likely because of the “cancelling” effect you mentioned). However, we found the types of properties we could instill with DPO more limited than SFT. For example, we tried using DPO to train a model to talk like a pirate (all the positive examples were in pirate speak, and the negative examples were normal), and it didn’t work.
Why does off-model SFT degrade capabilities?
We are working on this, though we’re still iterating to figure out the best way to do it (in particular, getting subliminal learning to work across models is difficult).
We tried negative inoculation (framing the interaction as misaligned) to achieve more transfer, but couldn’t get this to work, similar to here.
We’re trying :)
Super cool research btw!
The main benefit of training on outputs-only is that we can apply countermeasures: filtering, paraphrasing, etc. I don’t know how you could do that when training on internals (maybe an interpretability bottleneck like they do here?). however, training on the entire residual stream does seem promising for doing incrimination via distillation!
Surely these show up more in the internals if the model is deliberately trying to not explicitly show it’s bad behavior?