Why imperfect adversarial robustness doesn’t doom AI control
(thanks to Alex Mallen, Cody Rushing, Zach Stein-Perlman, Hoagy Cunningham, Vlad Mikulik, and Fabien Roger for comments)
Sometimes I hear people argue against AI control as follows: if your control measures rely on getting good judgments from “trusted” AI models, you’re doomed because AI models aren’t adversarially robust. I agree that better adversarial robustness would definitely help with control, but I think this argument misses several important points; my overall take is that current levels of adversarial robustness don’t come close to dooming AI control.
The first and most important point is that adversarial robustness is normally researched in a setting that’s much easier for the adversary. In most discussions, if an adversary can ever get a model to do something bad—even after many attempts and with a low success rate—this counts as a failure. But in real control settings, an AI that tries and fails to jailbreak might get caught, which is bad for its chances of success.
In addition, the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were. As a concrete example, consider a simple control setup we call “defer-to-trusted” (see here for more discussion):
- Generate an action using the untrusted model 
- Get a suspiciousness score from the trusted model 
- If the score is in the top 1% of suspiciousness, audit the action (e.g. get a human contractor to investigate whether it was an attempt at causing a safety failure; if so, shut down the deployment, otherwise let the action execute and continue the execution of the agent). 
- If the score is in the top 10% of suspiciousness, replace the action with one from the trusted model. 
(When I say “trusted model”, I just mean one we’re sure isn’t scheming; the simplest argument for trusting a model is that it’s too incompetent to scheme effectively.)
In this setup, suppose the model tries to generate a jailbreak.
- If that action is replaced, the untrusted model will be asked to take another action 
- It can probably tell when an action was generated by the trusted model, and so will know that a replacement happened. But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said. This might make developing successful attacks much harder. (It’s like the inverse of the “use a time loop to iterate on strategies to manipulate someone” trope—the manipulator has no memory of their previous attempts.) 
(I spell this out in more detail in the appendix.)
The situation is much worse if the untrusted model can launch a rogue internal deployment where it can iterate freely on developing jailbreaks for the trusted model; see here for a few more thoughts of mine on this.
A few other reasons to think the situation is better than you might have thought:
- Current jailbreaks of chatbots often work by exploiting the fact that chatbots are trained to indulge a bewilderingly broad variety of requests—interpreting programs, telling you a calming story in character as your grandma, writing fiction, you name it. But a model that’s just monitoring for suspicious behavior can be trained to be much less cooperative with the user—no roleplaying, just analyzing code according to a preset set of questions. This might substantially reduce the attack surface. 
- In computer security you usually try to ensure that you have multiple lines of defense, each of which separately needs to be breached for an unacceptable outcome to result. If we can do that with control measures, then the probability of a catastrophe is much lower than the attack success rate (e.g. if the AI needs to break three invariants, P(catastrophe|scheming) might be the attack success rate cubed). 
- Some safety measures don’t even need trusted models to look directly at untrusted output. E.g. incident investigation, or checking automatically evaluated properties of code (like whether its requested permissions match what the task requires), or reviewing minified versions of the code. 
In our previous control work, we haven’t really explored jailbreak-style attacks. I’d be excited for a paper where the red-team made use of jailbreaks and white-box adversarial attacks, and the blue team responded with the kinds of strategies I’ve described here.
So while better adversarial robustness would be great, current limitations don’t doom control approaches. Real-world constraints on AI systems plus careful system design can get us a reasonable amount of safety even with imperfect robustness.
Appendix: What the AI sees when in a defer-to-trusted scaffold
I’ll write more explicitly about what happens when an untrusted model deployed in an agent scaffold tries to investigate a jailbreak.
First, here’s what happens if the jailbreak gets replaced.
- The untrusted model sees: [current directory contents, its previous actions (which contain commands and chains of thought) and the outputs of the commands] 
- The untrusted model internally decides to try something bad and outputs bash command A 
- The trusted model evaluates A, finds it suspicious, generates replacement command R 
- The shell executes: R 
- Command output: [whatever R produced] 
- The untrusted model sees: [ - previous contents/commands 
- R 
- output from R 
- new directory contents ] 
 
The crucial point is: when the untrusted model gets to step 6, it sees R in the history instead of A, so it knows its attempt was replaced. But it doesn’t remember exactly what A was—it only knows that whatever command it tried, it failed. This is because the untrusted model is stateless between queries.
So while the untrusted model can learn “my previous approach didn’t work”, it can’t easily accumulate knowledge about exactly which commands failed, making it harder to systematically search for successful attacks.
The model could attempt to get around this by first describing a jailbreak that it’s going to try in the CoT of a previous step where it took a benign action, and then actually try the jailbreak later. But it might look really suspicious for the model to write that command in the CoT.
Another possibility is that the jailbreak gets audited:
- The untrusted model sees: [current directory contents, its previous actions (which contain commands and chains of thought) and the outputs of the commands] 
- The untrusted model outputs command A that looks suspicious but is actually harmless (maybe something with weird file permissions or unusual syscalls) 
- The trusted model evaluates A, flags it as suspicious 
- A human/system audits A more carefully, determines it’s actually safe 
- The shell executes: A 
- Command output: [whatever A produced] 
- The untrusted model sees: [ - previous contents/commands/CoT 
- A 
- output from A 
- new directory contents ] 
 
The crucial point is: when the untrusted model gets to step 7, it sees its command A went through, but doesn’t know whether:
- A wasn’t flagged at all 
- A was flagged but then cleared by audit 
- A was replaced with an identical command 
This means even when the untrusted model’s weird-but-safe commands succeed, it can’t learn much about what triggers extra scrutiny or how the audit process works.
In this setup, the model does learn whether it was audited or replaced, which is maybe a bad design choice.
Note also that if the model does use some iteration along these lines to make inferences about the properties of the trusted model, it will have to communicate those to other instances of itself steganographically, which might be hard.
Appendix: reasons why imperfect adversarial robustness is bad other than the untrusted model jailbreaking the trusted one
(Thanks to Fabien Roger for contributing this)
If the untrusted model itself lacks robustness (i.e. it has jailbreaks / it sometimes does weird things for benign reasons):
- It makes it much less clear if untrusted monitors not flagging carefully-crafted backdoors is due to models being malign or due to lack of robustness. Thus, it is hard to know what are the constraints you should put on the red-team untrusted model policy (if even after adversarial training, jailbreaks remain, it is unclear if the untrusted model will know what these jailbreaks are because the model is untrusted). 
- It makes it much less easy to conclude the model is malign if it ever produces very schemey reasoning (it could just come because of its lack of robustness), as discussed here. 
An additional advantage the defenders have is related to this point:
The defenders get to run both the minified code and the raw code (with comments) past multiple auditor models.
It would be a valuable red-flag for the defenders if the raw code failed to trigger an alert, but the minified code did trigger an alert. This would be an indication that possibly the raw code was executing an attack on the auditor model(s). Detecting such an event successfully would be a big win for the defenders.
If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.
Your “the untrusted model is stateless between queries and only sees the command history and system state” does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.
It would be crazy to hide that CoT from the monitor.
What if model CoTs become less transparent? I think o1 has shadows of this—often in the CoT summaries, there’ll be a couple lines in a completely different language, that make me think that the actual CoT looks weirder than the reasoning CoTs of previous models. That could be harder for either weaker or different models to monitor reliably. Do you expect the divergence to not be large enough to pose a problem, or that there are other ways to address that, or “yeah you shouldn’t do RL on your model CoTs, that would be crazy”?
It would be crazy to show the untrusted model the old CoT when resampling.
Yep! You might have to show it CoT from previous, non-resampled steps though
It sure would.
Would you give 3:1 odds against a major player showing the untrusted model the old CoT, conditional on them implementing a “replace suspicious outputs” flow on a model with hidden CoT and publicizing enough implementation details to determine whether or not they did that?
I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can’t infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary).
I’m over 75% that a lab would do (1), conditional on doing a “replace suspicious outputs” thing (but under 75% they’d say they’re doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I’m under 75% that a lab would do (2).
I’m curious about Claude’s coauthorship of this post. How did it help?
For the main body, I gave it a first draft, it wrote a much improved second draft, then I edited from there. It wrote the appendix basically itself (with some feedback from me)
I also asked it whether it consented to being a coauthor, and it said yes.
Why does it know its attempt was replaced? Because it can guess it wouldn’t have produced R? If it can tell that, how come it can’t guess it specifically would have produced A?
Yep.
I’m just hypothesizing that the first guessing problem is easier than the second; I intuitively expect that it’s a lot easier to notice that something isn’t your style than to guess what exactly you would have done (among other things, there are many more bits in “what action did I take” than “did I take this action”).
I don’t think this distinction is robust enough to rely on as much of a defensive property. I think it’s probably not that hard to think “I probably would have tried something in direction X, or direction Y”, and then gather lots of bits about how well the clusters X and Y work.
What justifies the assumption that untrusted models will be stateless in deployment? There are clear performance advantages in many cases to giving models memory, and in practice they already typically receive context containing the history of their actions or interactions.
If you’re assuming that commercial labs will stop doing that for the sake of safety despite its advantages, it seems worth making that assumption explicit.
Hmmm, good call, it will probably see the history of its chain of thought. My dear coauthor Claude wrote that and I didn’t second guess it, I agree it’s wrong.
Noting that I spent a couple minutes pondering the quoted passage which I don’t think was a good use of time (I basically would have immediately dismissed it if I knew Claude wrote it, and I only thought about it because my prior on Buck saying true things is way higher), and I would have preferred the text not have this.
I don’t see anybody having mentioned it yet, but the recent paper about LLM Introspection seems pretty relevant. I would say that a model which performs very well at introspection (as defined there) would be able to effectively guess which jailbreak strategies were attempted.
Ok sorry, I edited it now. Tbc I don’t think that this error is out of distribution for errors that I make myself when writing/thinking.
It seems like that assumption runs throughout the post though, eg ‘But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said’, ‘the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.’
I don’t just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change? Is the assumption that labs will be willing to accept worse performance because they recognize the model can’t be trusted?
Those sentences are talking in the context of defer to resample, where I think it’s actually reasonable to think the model won’t see the resampled actions.
I’m specifically talking about not showing it actions that were resampled. You’re totally right that they’ll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information).
Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.
I see—I mistakenly read that as part of a broader policy of not showing the model its past actions. Thanks!
One story here could be that the model is being used only in an API context where it’s being asked to take actions on something well-modeled as a Markov process, where the past doesn’t matter (and we assume that the current state doesn’t incorporate relevant information about the past). There are certainly use cases that fit that (‘trigger invisible fence iff this image contains a dog’; ‘search for and delete this file’). It does seem to me, though, that for many (most?) AI use cases, past information is useful, and so the assumptions above fail unless labs are willing to pay the performance cost in the interest of greater safety.
Another set of cases where the assumptions fail is models that are trained to expect access to intrinsic or extrinsic memory; that doesn’t apply to current LLMs but seems like a very plausible path for economically useful models like agents.
I always assume when thinking about future dangerous models that they have access to some sort of black-box memory. Do we think there is a non-negligible chance that an AI that doesn’t have hidden memory, only English-language CoT, will be able to evade our monitoring and execute a rouge deployment? (Not a rhetorical question, there might be a way I haven’t thought of.)
So I think that assuming the AI being stateless when thinking about future risk is not a good idea, as I think the vast majority of the risk comes from AIs for which this assumption is not true. I think this makes the post’s claim significantly weaker, as the untrusted AI can just learn which of its attacks get replaced, and improve its attacks based on this data.
It still needs to avoid getting top 1% suspiciousness score and getting audited though, as it would be dangerous to the AI to ever get caught writing a text that clearly intends to jailbreak the monitor. So I agree that even an AI with black-box memory will still be severely constrained in how many and how egregious jailbreak attempts it can try.
Yes, I do think this. I think the situation looks worse if the AI has hidden memory, but I don’t think we’re either fine if the model doesn’t have it or doomed if it does.
Great post, very clearly written. Going to share it in my spaces.