As of 2025, I still find this post (and comments) helpful for understanding different things people mean by “CoT unfaithfulness” and why these things still largely aren’t reasons to give up on CoT monitoring for current models.
Alexa Pan
I think the title question remains important and neglected. I encourage people to revisit this question as we get closer to AIs that (are able to) exhibit scary/egregiously misaligned behavior.
Specifically, it might be good to have answers to questions like:
What do we expect to be the best kind of evidence for convincing developers to shutdown/undeploy?
Maybe there is more compelling evidence than “your AI writing code to escape”
I’m reminded of hearing about how the public reception of the OAI/Apollo scheming paper was surprising relative to some people’s expectations (people are more freaked out by weird CoT, less freaked out by actual scheme-y behavior).
What are good plans for AI labs when they have such evidence?
What are good plans for AIS researchers/advocacy people when they have such evidence?
Should you pursue safety research agenda X if shutdown/undeploy is hard?
To spell out the thinking here: you might conceptualize safety research as having broadly two paths to impact:
1- “Training against”: give training signal we can optimize over to reduce egregious misalignment
2- “Shutdown”: give evidence to convince AI developers to slow down or un-deploy
One implication of thinking that 2 is very hard is that your research should either focus aggressively on 1 or risk being not that useful.
I think this post does a good job of arguing that preventing collusion is an important open problem for control / untrusted monitoring in particular and suggesting strategies for this.
I weakly wish this post were broken down into smaller chunks which separately explain “a basic design for untrusted monitoring” and collusion-specific analysis.
It’s perhaps important to note that this post conservatively assumes that an untrusted monitor will consistently try to collude with the policy when ”...the untrusted model you’re using as a monitor is very similar to the model you’re using as the policy, e.g. they might be different fine-tunes of models that have already been pre- and post- trained. In principle, you could use different models”.It’d be nice to conceptually and empirically investigate the extent to which using relevantly similar models lead to collusion
Using different/more decorrelated models, or “mixed deployments”, could be a useful and practical way to prevent collusion
This post might be worth revisiting for experiment ideas when models seem capable enough to exhibit interesting collusion behavior.
I think the high-level point in this post is correct and important, though not super action-guiding.
I find it interesting to notice in 2025 that this post failed to mention the “obvious” social failure mode of extreme power concentration/power grabs/coups. Seems like a missed opportunity for this particular reflection.
This post summarizes the basic structure of high-stakes control evaluations (“win/continue/lose scenarios”) and the building blocks of blue team strategies (“E/R/A protocols”).
I found this post quite helpful for understanding control evals. I somewhat strongly recommend people new to control read this post before reading high-stakes control papers, e.g. the original control paper and Ctrl-Z.
Nit: It’s not totally clear what counts as a control “protocol”. According to the OG post on control, this is “a proposed plan for training, evaluation, and deployment (which we’ll refer to as a whole as a protocol)”—but ERA protocols exclude training (including online training in deployment), which we might want to do for the untrusted policy and/or monitor. It might be nice to reserve the word “protocol” for “everything we might want to do over the lifetime of a model possibly able to cause concentrated failures” and have a different word for “a set of things we typically do together in deployment”.
This post defined control as a class of safety techniques, argued for substantially investing in it, and proposed some actions this could imply. I think this is an incredibly useful type of post: I would like to see “the case for x” for many more safety agendas.
1.5 years later, I think this post remains ~the best introduction to control if one already has fair context on technical safety. Some key claims that have aged well (not a comprehensive summary):Control is a distinct/complementatry approach to alignment, and aims to robustly reduce the probability of catastrophic outcomes from worst-case schemers
Control research mostly comprise technique iteration and safety evaluations
Control evaluations are fundamentally dangerous capability evaluations which we can expect to be informative up to a certain capability regime
Control is likely achievable for early “transformatively useful” AIs (AIs capable of speeding up AI R&D and other helpful tasks by ~30x)
Rough taxonomy of control techniques: high quality, expensive oversight; fast, cheap automated oversight; adversarial measures against untrusted AIs, e.g. collusion busting, behavioral red teaming; general security measures; training.
Taxonomy of trusted vs. untrusted AIs, where we rule-in trust by ruling out specific levels of scheming capabilities.
One crux for the value of control remains “will there be sufficient time between ‘transformatively useful’ and ‘uncontrollable’ capability regimes”. (At least, I haven’t made strong updates in either direction.)
Some ways in which I disagree with this post/feel this post might have become less representative of control over time:
Nit: The post seems to use “scheming” in the sense of “worst case schemers”. As discussed elsewhere, this seems good to make explicit to disambiguate from other senses of scheming, e.g. power-motivated instrumental training-gaming.
The case for focusing on black-box control
I’m somewhat excited about white-box control (e.g. coup probes, weight interp, steering). I also sense that white-box techniques may be more back on the table than when this post was written, but not sure. A more general shift in this direction might be interest in “non-conservative control” which includes low assurance techniques which are cheap enough to just add to the toolkit.
“We think we probably won’t run into issues related to qualitatively wildly superhuman skills in problematic domains”, e.g. hacking, persuasion
My sense is that this claim about the necessary capability profile for control has become less loadbearing over time (perhaps because there’s reason to believe that untrusted monitoring might just work out fine even for e.g. superhuman hackers?)
Also, diffuse/low-stakes control focuses on the threat model where AI might give “normal/innocent looking” bad actions which are hard to catch/incriminate but we can withstand a large number of such bad actions. Measures against this threat model might address some worries with oversight here.
This post doesn’t elaborate much on ways in which one might model AIs to intentionally subvert safety measures, e.g. research sabotage.
There seems to have been more work on training-based control than is represented in this post.
I think it would be cool to have a more in-depth retrospective of ways in which control has developed/changed since this post. I won’t attempt to do that here.
Thanks, I agree with you here! Do you think that’s an assumption (about what misalignment looks like) we should generally work more with?
Supposing it is, I’m not immediately sure how this would change the rest of the argument. I guess considerations for “should we intervene on persona A’s knowledge of persona B” could turn out to be quite different, e.g. one might not be particularly worried about this knowledge eliciting persona B or making the model more dangerously coherent across contexts.
Oops, fixed
Will misaligned AIs know that they’re misaligned?
What would an IRB-like policy for AI experiments look like?
Thanks for flagging, it’s interesting to know that this idea has consideration traction. Maybe we should have responded in more detail given this.
Our response was mainly the below:If models can’t confidently distinguish test and deployment, they may be super paranoid about being tested and behave well out of caution in deployment. This is still bad news, since it suggests that more capable models might behave dangerously in deployment, but doesn’t provide clear warning signs or examples of misbehavior in the wild to study.
Also, this doesn’t seem like a scalable solution to more capable models which will probably distinguish test/deployment at some point.
This post highlighted ways in which OAI’s CBRN evals were uninformative about how close o1-preview was to action-relevant thresholds. I think it’s valuable for increasing public knowledge about model capabilities/risk profiles and keeping developers accountable.
I particularly appreciate the post breaking down the logic of rule-out evals (“is this test clearly easier than [real world threat model]” and “does the model clearly fail this test”). This frame still seems useful for assessing system cards in 2025.