Hmm i see, I’m not sure I share the intuition that on-policy is super different from on-model (i.e. with a different prompt), for example, we didn’t see a difference here: https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities
But thanks for suggesting, I think it’s worth a try, if I get some time this week I’ll run the experiment!
SebastianP
How to reduce capability degradation from off-model SFT
Probably a quibble, but I’m imagining this being more diffuse, like making all benign actions 30% more suspicious. So this is a continuous problem, because the more it can get away with this the more the safety scores drop, since probabilistically more benign actions are rated higher than the first real attack. Like, to the extent the monitor ROC curves are already imperfect, this is because benign actions already sometimes look suspicious by default, so one could imagine techniques that move this curve even in a benign model.
I think the training argument still makes sense here? If we’re assuming no exploration hacking problems (which can probably be resolved with a bit of SFT at the beginning), then it seems reasonable that we can get the model to be the lowest possible suspicion it could be (and we proxy this with the benign model.
In general, I think I’m a bit confused why you think the problem with benign stats implicates commented code anymore than it does normal code (e.g. ctrl-z). Maybe there’s a part of your argument I’m misunderstanding?If we’re planning on deploying that U, I’m fairly scared about giving it optimization pressure toward predicting any monitor’s sus scores.
Yep seems very reasonable! I’m also a bit worried about how reasonable this is fwiw.
My own initial naive idea was to just make a long list of design patterns that prioritize legibility of code, like an extended Claude.md or a “model spec for coding”, and train it to follow these patterns (which shouldn’t be too hard to verify).
Yeah that’s actually basically what we do here! But we proxy ‘train bad U to be legible’ with ‘prompt good U to be legible’, which of course abstracts out some concerns related to how good the training signal is, how good generalization is, etc. But I agree with you above that it seems like we’d have good signal for legibility-type things.
Oh interesting! Why do you think this would make it more robust?
Yeah, I’m pretty skeptical about 1. It seems hard to convince lab leadership using such a technique, but honestly the technique isn’t that complicated so maybe it’s doable. I think the basic pitch of “we trained a different model to imitate U and it became evil, which means U was also probably evil, but just hiding it” is actually moderately convincing to laypeople, though the subliminal learning stuff makes it complicated for sure. Either way, I think skepticism is super reasonable here!
As for 2, we originally thought there might be some nice ways to do this. For example, maybe the honeypots that were successful on the distilled model might give us insight into the honeypots that will work on U (in some sense narrowing the hypothesis space for red-teamers). I think this seems pretty hard given U’s strength, which is why we didn’t include this idea in the post. In a very optimistic scenario, we might be able to just have the distilled model tell us about U’s misalignment—for example the distilled model might be able to tell us what type of misaligned U is (schemer, reward seeker, etc), tho I find this very unlikely.
Overall I’m more excited by the benefits of AI safety people taking such evidence seriously. I think this could cause a lot of important actions, such as not trusting safety research by U, improving control protocols and allocating more resources to finding convincing behavioral evidence.
Hey, thanks for the comment, and sending the doc from Vivek (was very informative)!
I think in practice it’s possible to get your model to do bad behaviors (like being obscurantist) less frequently than your audit budget, such that it doesn’t mess up the monitor roc curves.
Concretely, we could do some things like:
- Train against a different (hopefully decorrelated) monitor’s suspicion scores, on U’s usefulness run. Or instead of a monitor’s suspicion score, we could substitute a “legibility score” for comment clarity.
- Use a trusted model/humans to comment over the trajectories they understand and then SFT on thoseIf we did that, combined with sampling trajectories identical to the deployment distribution for the control eval, I think we’d end up in the situation A you label above, except where the summary statistics are sufficiently close to benign (because we trained them to be).
I definitely think it’s not totally clear whether the benign stats assumption is reasonable, but I kinda buy the argument above. Let me know if I’m misunderstanding you, also happy to chat! Also thanks for bringing this up, I hadn’t thought super deeply about this before :)
Glad to hear you liked it :)
If you mean on the red team side, then actually the way we make the backdoored models is through prompt distillation! We generate normal responses with a normal system prompt, and then the backdoored responses with a different system prompt that instructs it to do the behavior (e.g. mention cats). In the case of some backdoors, we’ll synthetically generate the outputs. Then we just prepend the trigger to the user prompts of the backdoored responses and train.
If you meant on the blue team side (e.g. on-policy distillation), we have not! My guess is that things that are on-policy, or have sparse supervision (e.g. RL) will do much worse at training out backdoors.
Advice for making robust-to-training model organisms
This was Qwen3-30b-a3b-instruct-2507. I did lora finetuning with tinker, and generated the pirate/normal data with llama 8b. Oh that’s very interesting huh! I haven’t played around much with DPO but it seems to be a very finnicky technique
I was super surprised by this too, so I tried a few things to try and make it work (training for longer + higher lr, using DPOP, changing KL hyperparameter). The thing that worked best was turning down the beta hyperparameter, which controls the scaling for the kl divergence loss term. This allowed the model to speak like a pirate (tho not super consistently), but it also cooked the model on-distribution (alpaca) and slightly off-distribution (math). For example, the model sometimes just outputs this:
(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(**
Claude explains why (and correctly guessed the result of the experiment): https://claude.ai/share/7c1616db-23c4-4835-b3ba-966456fbfa83
Claude’s solution is to just add an SFT loss term (which runs into the problem from this post)
Yep we have! And you’re right, it doesn’t cause degradation (likely because of the “cancelling” effect you mentioned). However, we found the types of properties we could instill with DPO more limited than SFT. For example, we tried using DPO to train a model to talk like a pirate (all the positive examples were in pirate speak, and the negative examples were normal), and it didn’t work.
Why does off-model SFT degrade capabilities?
We are working on this, though we’re still iterating to figure out the best way to do it (in particular, getting subliminal learning to work across models is difficult).
We tried negative inoculation (framing the interaction as misaligned) to achieve more transfer, but couldn’t get this to work, similar to here.
We’re trying :)
I think the type of misalignment we hope to uncover here is more explicit, something like “advocates for killing all humanity” that you sometimes see with Emergent Misalignment. I think it’d be substantially harder to get this type of misalignment by accident. However, if we saw lower forms of misalignment like the examples you cite, I agree that it’d be much harder to incriminate. Trying to do many distillation runs also exacerbates this concern.
Agreed that off-policy distillation is weird and we don’t have a great understanding of generalization! I’m actually pretty optimistic that we can avoid this problems by running against baselines (i.e. training on off-policy trusted data and making sure that doesn’t also cause misalignment).
Second, are we sure that discovering the teacher model to be misaligned would actually be handled correctly? To an extent, this overlaps with your reason #5 it might not work, but I see it as not identical. All of the frontier labs have produced models whose alignment has been questioned. Some of these alignment issues have taken months or years to fix and some remain unfixed to this day. This does not make me confident that if this technique diagnosed a model as misaligned, that the problem would actually be fixed.
This is definitely a valid concern! I think the type of misalignment we hope to uncover here is more explicit than the examples you cite, something like “advocates for killing all humanity” that you sometimes see with Emergent Misalignment. In general though, I think that it’s plausible that labs won’t handle this correctly, and that detracts some impact from detection methods like this! The hope is that even if this is true, people in the safety community will take such evidence seriously and reallocate resources/effort in response.
Incriminating misaligned AI models via distillation
I’m not sure what the best approximation for an open weights model is here, but I’d be interested to see a helpful only variant.
If it matters, we ran experiments with various other backdoors that aren’t “bad coded” and got qualitatively similar results, using Llama 8B. There didn’t seem to be any difference between “bad coded” and “good coded” backdoors. We’ll release those results in an upcoming blog post!
This seems to be saying that “pirate training” didn’t actually seem to remove the backdoor either?
Yep! As in, the backdoor is still present in the weights. I’m unsure about whether that’s the best we could get. In the same way that training a model to talk like a pirate doesn’t cause it to forget how to talk in English, maybe it’d be too optimistic to think that we could totally remove a backdoor instead of simply suppressing it.
Nice, thanks for running this! Hmm I’m not sure I buy the model size claim, seems like you’d probably get severely diminishing returns in terms of “how well the model represents the DPO diff”, since smaller models these days are already seem to have good world models. b makes a lot of sense tho!