Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah
I think Oliver’s pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)
The experimental setup (in the sense of getting bad behavior despite perfect labels on the training set) was also done prior to the popularization of reward-as-chisel.
Great research, thanks for doing it!
My sense is that there are a lot of caveats on this result:
The reward hack is easy to execute (likely doesn’t require reasoning, or only very little reasoning)
There is a single task; the model can learn very specific heuristics to that task rather than being more general
The reward function is deliberately designed to cause obfuscation, and is about as close to process-based as you can get while not looking directly at the thoughts. (The reward literally strips out the actual command.)
The task is sabotaged, so that reward hacking is required 100% of the time to get rewards
The results don’t seem that scary? For the CoT judge penalty to go down from ~0.32 to ~0.13, the pass rate goes down 11%. It’s not quite clear how to interpret this given how artificial the setting is, but things you could take away -- (a) obfuscatory rewards are not competitive or (b) safety rewards do actually succeed in reducing the bad behavior.
How hard did you have to try to get this result? Was this just the first thing you tried, where you made everything as easy as possible to get some result? Or did you try a bunch of things, and have to relax them to get obfuscation to arise? If it’s the latter, I feel like this should be overall good news for CoT monitorability.
We note that the mean pass rate of the penalty runs was 11% lower than the no penalty runs.
What are the absolute pass rates in the two cases?
(I’m also unsure if you mean 11% or 11 percentage points.)
If we use Myopic Optimization from Non-myopic Approval (MONA), then the AI will not want to engage in sneaky power-seeking behaviors, because such behaviors will not help with the short-term tasks that the AI-in-a-lab wants to do
I mainly rule out MONA for capabilities reasons; see Thoughts on “Process-Based Supervision” §5.3.
I’d make this argument even for regular outcomes-based RL. Presumably, whenever you are providing rewards (whether in brain-like AGI or RLHF for LLMs), they are based on actions that the AI took over some bounded time in the past, let’s call that time bound T. Then presumably sneaky power-seeking behaviors should be desired inasmuch as they pay out within time T. Currently T is pretty small so I don’t expect it to pay out. Maybe you’re imagining that T will increase a bunch?
Secondarily, I have a concern that, even if a setup seems like it should lead to an AI with exclusively short-term desires, it may in fact accidentally lead to an AI that also has long-term desires. See §5.2.1 of that same post for specifics.
Sure, I agree with this (and this is why I want to do MONA even though I expect outcomes-based RL will already be quite time-limited), but if this is your main argument I feel like your conclusion should be way more measured (more like “this might happen” rather than “this is doomed”).
3.5 “The reward function will get smarter in parallel with the agent being trained”
This one comes from @paulfchristiano (2022): “I’m most optimistic about approaches where RL is only performed on a reward function that gets smarter in parallel with the agent being trained.”
I, umm, don’t really know what he’s getting at here. Maybe something like: use AI assistants to better monitor and catch sneaky behavior? If so, (1) I don’t expect the AI assistants to be perfect, so my story still goes through, and (2) there’s a chicken-and-egg problem that will lead to the AI assistants being misaligned and scheming too.
Re: (1), the AI assistants don’t need to be perfect. They need to be good enough that the agent being trained cannot systematically exploit them, which is a much lower bar than “perfect” (while still being quite a high bar). In general I feel like you’re implicitly treating an adversarial AI system as magically having perfect capabilities while any AI systems that are supposed to be helpful are of course flawed and have weaknesses. Pick an assumption and stick with it!
Re: (2), you start with a non-scheming human, and bootstrap up from there, at each iteration using the previous aligned assistant to help oversee the next AI. See e.g. IDA.
There are lots of ways this could go wrong, I’m not claiming it will clearly work, but I don’t think either of your objections matters much.
Huh, not sure why my Ctrl+F didn’t find that.
In context, it’s explaining a difference between non-reasoning and reasoning models, and I do endorse the argument for that difference. I do wish the phrasing was slightly different—even for non-reasoning models it seems plausible that you could trust their CoT (to be monitorable), it’s more that you should be somewhat less optimistic about it.
(Though note that at this level of nuance I’m sure there would be some disagreement amongst the authors on the exact claims here.)
However, CoTs resulting from prompting a non-reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
Where is this quote from? I looked for it in our paper and didn’t see it. I think I disagree with it directionally, though since it isn’t a crisp claim, the context matters.
(Also, this whole discussion feels quite non-central to me. I care much more about the argument for necessity outlined in the paper. Even if CoTs were directly optimized to be harmless I would still feel like it would be worth studying CoT monitorability, though I would be less optimistic about it.)
A couple of years ago, the consensus in the safety community was that process-based RL should be prioritized over outcome-based RL [...]
I don’t think using outcome-based RL was ever a hard red line, but it was definitely a line of some kind.
I think many people at OpenAI were pretty explicitly keen on avoiding (what you are calling) process-based RL in favor of (what you are calling) outcome-based RL for safety reasons, specifically to avoid putting optimization pressure on the chain of thought. E.g. I argued with @Daniel Kokotajlo about this, I forget whether before or after he left OpenAI.
There were maybe like 5-10 people at GDM who were keen on process-based RL for safety reasons, out of thousands of employees.
I don’t know what was happening at Anthropic, though I’d be surprised to learn that this was central to their thinking.
Overall I feel like it’s not correct to say that there was a line of some kind, except under really trivial / vacuous interpretations of that. At best it might apply to Anthropic (since it was in the Core Views post).
Separately, I am still personally keen on process-based RL and don’t think it’s irrelevant—and indeed we recently published MONA which imo is the most direct experimental paper on the safety idea behind process-based RL.
In general there is a spectrum between process- and outcome-based RL, and I don’t think I would have ever said that we shouldn’t do outcome-based RL on short-horizon tasks with verifiable rewards; I care much more about the distinction between the two for long-horizon fuzzy tasks.
I do agree that there are some signs that people will continue with outcome-based RL anyway, all the way into long-horizon fuzzy tasks. I don’t think this is a settled question—reasoning models have only been around a year, things can change quite quickly.
None of this is to disagree with your takeaways, I roughly agree with all of them (maybe I’d have some quibbles about #2).
Since I don’t quite know what you mean by “faithful” and since it doesn’t really matter, I switched “faithful” to “monitorable” (as we did in the paper but I fell out of that good habit in my previous comment).
I also think normal outputs are often quite monitorable, so even if I did think that CoT had exactly the same properties as outputs I would still think CoT would be somewhat monitorable.
(But in practice I think the case is much stronger than that, as explained in the paper.)
Sure, but the same argument would suggest that the model’s thoughts follow the same sort of reasoning that can be seen in pretraining, i.e. human-like reasoning that presumably is monitorable
I agree grammar can definitely go away in the presence of optimization pressure to have shorter reasoning (though I might guess that it isn’t just a length penalty, but something more explicitly destructive of grammar). When I said “by default” I was implicitly thinking of no length penalties or similar things, but I should have spelled that out.
But presumably the existence of any reasoning models that have good grammar is a refutation of the idea that you can reason about what a reasoning model’s thoughts will do based purely on the optimization power applied to the thoughts during reasoning training.
There’s esssentially zero optimization pressure for CoT to be a faithful representation of the LLM’s actual thought process, just for it to be useful for the final answer.
Compare: There’s essentially zero optimization pressure for CoT to have good grammar, just for it to be useful for the final answer.
Are you surprised that by default (EDIT: in the absence of length penalties or similar techniques) reasoning models’ thoughts have good grammar?
(Tbc I think the a priori case for the CoT to be
mostly-faithfulmonitorable for hard tasks is substantially better than the a priori case for the model’s thoughts to have good grammar by default.)
(EDIT: Changed to monitorable because I don’t know what people mean by “faithful”)
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Yeah, I’d guess that the explanation is that the refusal circuits are stronger at the first output token position and weaker later on, given that models aren’t that impressive at steg. I would nevertheless like to see someone investigate this, though—the MONA paper showed that models can learn to pass a single bit of information using semantic steganography, so it’s at least plausible that the reasoning about AF could be passed through tokens in a similar way.
Note that in normal setups, in principle later tokens are usually capable of replicating whatever reasoning happened at the earlier tokens (they can just copy it over via an attention head, or run the exact same circuit themselves). The challenging part is doing additional serial reasoning with the results of the previous reasoning. So I wouldn’t use this as a reason to predict that something like a “refusal circuit” is weakening as the number of tokens increase.
In the MONA single bit steganography environment, there are two steps, which take place in different contexts. The summary is produced with the entire loan application in context, but the decision is produce with only the summary in context (not the loan application). That’s why steganography is required; if the loan application remained in context (as it would in a standard chatbot setup) then no steganography would be needed, the decision step could just ignore the summary and use the application itself.
Okay, that position seems to me equivalent to the position that models will be more capable at evading monitors when they intrinsically want to evade monitors, than when they are explicitly prompted to evade monitors. Do you also endorse that position? (If not, why isn’t it implied by your stance?)
That position seems pretty unlikely to me (for current models). Models are trained really hard to follow instructions in the prompt to the best of their ability.
The settings the paper tests seem closer to “complex reasoning for a single output / decision / choice” than “complex reasoning for many outputs / decisions / choices”. If the paper tried to explicitly test whether model capacity to evade monitors in settings with single choices vs many choices and draw conclusions about what settings monitoring might work in, I’d find it more compelling. Given that instead their conclusion is stated more generally about difficult bad behaviors being easy to monitor because of these results, I think it’s reasonable to consider it much stronger than the results should imply.
This seems to imply that your position that it would be easier for models to evade monitors when there is “complex reasoning for many outputs” than when there is “complex reasoning for a single output”. Is that correct? If so, why do you think that?
I would assume that many outputs that all require complex reasoning just mean multiple single outputs that require complex reasoning, so if the model can’t evade on single outputs that require complex reasoning then it shouldn’t be able to evade on multiple outputs that require complex reasoning. (Or more simply: if models can’t evade monitors on a short task, they shouldn’t be able to evade monitors on a long task, holding difficulty per step constant.)
Nice work, it’s great to have more ways to evaluate monitoring systems!
A few points for context:
The most obvious application for misuse monitoring is to run it at deployment on every query to block harmful ones. Since deployments are so large scale, the expense of the monitoring system matters a ton. If you use a model to monitor itself, that is roughly doubling the compute cost of your deployment. It’s not that people don’t realize that spending more compute works better; it’s that this is not an expense worth paying at current margins.
You mention this, but say it should be considered because “frontier models are relatively inexpensive”—I don’t see why the cost per query matters, it’s the total cost that matters, and that is large
You also suggest that robustness is important for “safety-critical contexts”—I agree with this, and would be on board if the claim was that users should apply strong monitoring for safety-critical use cases -- but elsewhere you seem to be suggesting that this should be applied to very broad and general deployments which do not seem to me to be “safety-critical”.
Prompted LLMs are often too conservative about what counts as a harmful request, coming up with relatively stretched ways in which some query could cause some small harm. Looking at the examples of benign prompts in the appendix, I think they are too benign to notice this problem (if it exists). You could look at performance on OverRefusalBench to assess this.
Note that blocking / refusing benign user requests is a really terrible experience for users. Basically any misuse detection system is going to flag things often enough that this is a very noticeable effect (like even at 99% accuracy people will be annoyed). So this is a pretty important axis to look at.
One might wonder how this can be the case when often AIs will refuse fairly obviously benign requests—my sense is that usually in those cases, it’s because of a detection system that was targeted at something much more serious (e.g. CSAM) that is over-triggering, but where the companies accept that cost because of the seriousness of the thing they’re detecting.
Evaluating and monitoring for AI scheming
I agree that fine-grained configuration mechanisms probably increase security, and this might net increase security, as I say in the post.
You mention permission systems, which is certainly a big deal, but I didn’t see anything about broader configuration mechanisms, much of which can be motivated solely by efficiency and incidentally helps with security. (I was disputing your efficiency → less security claim; permissions mechanisms aren’t a valid counterargument since they aren’t motivated by efficiency.)
One reason I’m skeptical is that you can get almost all that value by having AIs voluntarily adopt mechanisms that generally isolate their jobs from the other jobs that are running (e.g. having naming conventions about who is allowed to read or write what) that get you the reliability benefits without getting any security.
Hmm, I think this assumes the AIs will be more reliable than I expect them to be. You would want this to be reliable enough that your (AI or human) software engineers ~never have to consider non-isolation as a possible cause of bugs (unless explicitly flagged otherwise); I’d guess that corresponds to about 6 nines of reliability at following the naming conventions? (Which might get you to 8 nines of reliability at avoiding non-isolation bugs, since most naming convention errors won’t lead to bugs.)
Some complete guesses: human software engineers might get 0.5-2 nines of reliability at following the naming conventions, a superhuman coder model might hit ~4 nines of reliability if you made this a medium priority, if you add a review step that explicitly focuses just on checking adherence to naming conventions then you might hit 6 nines of reliability if the critique uses a model somewhat more powerful than a current frontier model. But this is now a non-trivial expense; it’s not clear to me it would be worth it.
Technologies that allow workers to be more isolated from each other gain you both convenience (because your coworkers no longer accidentally mess up what you’re doing) and also security (because you can remove your coworker’s permission to affect the code you’re running), but generally reduce efficiency. When we try to buy efficiency at the cost of convenience, we might lose security too.
Hmm, this feels less likely to me. Isolation can often be an efficiency benefit because one employee’s mistake doesn’t propagate to screwing up the work of everyone else, and this benefit scales as the number of employees increases. This assumes that labor is at least somewhat unreliable, but I do expect that AI labor will be overall unreliable, just because reliable AI labor will likely be much more expensive and won’t look cost-effective to deploy in most situations.
I’d expect that you buy efficiency by building more fine-grained configuration mechanisms and requiring that employees use them. This will also become easier with AI, since you can have specialized AI assistants that understand the configuration mechanisms well and can set them for all experiments (unlike the current status quo where all your employees have to be separately trained on the use of the configurations and many of them forget it and just copy configurations from elsewhere).
In general it seems like this should improve security by creating more info and legibility around what is happening (which in turn gives you more affordances to enforce security). Some examples (not necessarily about AI):
For code that is costly to run (e.g. uses lots of compute, causes physical-world effects, affects production systems), spending labor to run automated bug finding tools that generate lots of false positives, and addressing the results. This directly increases security by having fewer bugs.
Breaking up big jobs into many smaller ones that are easier to schedule across your compute fleet. This can enable you to isolate small jobs from each other which reduces the attack surface.
Requiring detailed analysis and justification before using significant amounts of compute for experiments. This should increase security against rogue internal deployments, as it becomes easier to catch the AI in a lie.
That said, I did think of one example that worsens security—switching from a memory-safe language like Python to one that requires manual memory management like C. Though in this case, it’s that the “security infra” (automatic memory management) itself is too compute-expensive to run, which doesn’t seem like it will apply in the AI case.
Overall my very weak guess is that the net effect will be to increase security. Of course, the part where you build more fine-grained configuration mechanisms is “novel (potentially) security-critical infrastructure”, so this does support your broader point. (When I say it will increase security, this is under the assumption that the novel infra isn’t sabotaged by scheming AIs.)
From the paper:
I agree this is clearly better if you trust your LLRs to be 100% accurate, but ultimately this has to be fit on some training set and then we are depending on it generalizing to deployment. However, that generalization could fail just for normal ML reasons (the deployment distribution is different, the LLR scores were noisily estimated) or because the model is scheming and adapts to your monitoring strategy.
Suppose I want to rule by fiat that for outputs O1 and O2 where O1 gets higher scores than O2, we are guaranteed to be at least as likely to audit O1 as O2. How would you suggest I go about doing that?
After writing the above I remembered you said the word “kernel” and looked in the paper for that and saw you did mention this limitation, and suggested that practitioners “increase kernel width or enforce monotonicity”. So I guess my question is, how should I go about enforcing monotonicity? I still have to map two scores (m1, m2) to a decision of whether or not to audit; it’s unclear to me that I can brute force that problem, much easier to brute force when I have a single score (the LLR).