This post defined control as a class of safety techniques, argued for substantially investing in it, and proposed some actions this could imply. I think this is an incredibly useful type of post: I would like to see “the case for x” for many more safety agendas.
1.5 years later, I think this post remains ~the best introduction to control if one already has fair context on technical safety. Some key claims that have aged well (not a comprehensive summary):
Control is a distinct/complementatry approach to alignment, and aims to robustly reduce the probability of catastrophic outcomes from worst-case schemers
Control research mostly comprise technique iteration and safety evaluations
Control evaluations are fundamentally dangerous capability evaluations which we can expect to be informative up to a certain capability regime
Control is likely achievable for early “transformatively useful” AIs (AIs capable of speeding up AI R&D and other helpful tasks by ~30x)
Rough taxonomy of control techniques: high quality, expensive oversight; fast, cheap automated oversight; adversarial measures against untrusted AIs, e.g. collusion busting, behavioral red teaming; general security measures; training.
Taxonomy of trusted vs. untrusted AIs, where we rule-in trust by ruling out specific levels of scheming capabilities.
One crux for the value of control remains “will there be sufficient time between ‘transformatively useful’ and ‘uncontrollable’ capability regimes”. (At least, I haven’t made strong updates in either direction.)
Some ways in which I disagree with this post/feel this post might have become less representative of control over time:
Nit: The post seems to use “scheming” in the sense of “worst case schemers”. As discussed elsewhere, this seems good to make explicit to disambiguate from other senses of scheming, e.g. power-motivated instrumental training-gaming.
The case for focusing on black-box control
I’m somewhat excited about white-box control (e.g. coup probes, weight interp, steering). I also sense that white-box techniques may be more back on the table than when this post was written, but not sure. A more general shift in this direction might be interest in “non-conservative control” which includes low assurance techniques which are cheap enough to just add to the toolkit.
“We think we probably won’t run into issues related to qualitatively wildly superhuman skills in problematic domains”, e.g. hacking, persuasion
My sense is that this claim about the necessary capability profile for control has become less loadbearing over time (perhaps because there’s reason to believe that untrusted monitoring might just work out fine even for e.g. superhuman hackers?)
Also, diffuse/low-stakes control focuses on the threat model where AI might give “normal/innocent looking” bad actions which are hard to catch/incriminate but we can withstand a large number of such bad actions. Measures against this threat model might address some worries with oversight here.
This post doesn’t elaborate much on ways in which one might model AIs to intentionally subvert safety measures, e.g. research sabotage.
There seems to have been more work on training-based control than is represented in this post.
I think it would be cool to have a more in-depth retrospective of ways in which control has developed/changed since this post. I won’t attempt to do that here.
This post defined control as a class of safety techniques, argued for substantially investing in it, and proposed some actions this could imply. I think this is an incredibly useful type of post: I would like to see “the case for x” for many more safety agendas.
1.5 years later, I think this post remains ~the best introduction to control if one already has fair context on technical safety. Some key claims that have aged well (not a comprehensive summary):
Control is a distinct/complementatry approach to alignment, and aims to robustly reduce the probability of catastrophic outcomes from worst-case schemers
Control research mostly comprise technique iteration and safety evaluations
Control evaluations are fundamentally dangerous capability evaluations which we can expect to be informative up to a certain capability regime
Control is likely achievable for early “transformatively useful” AIs (AIs capable of speeding up AI R&D and other helpful tasks by ~30x)
Rough taxonomy of control techniques: high quality, expensive oversight; fast, cheap automated oversight; adversarial measures against untrusted AIs, e.g. collusion busting, behavioral red teaming; general security measures; training.
Taxonomy of trusted vs. untrusted AIs, where we rule-in trust by ruling out specific levels of scheming capabilities.
One crux for the value of control remains “will there be sufficient time between ‘transformatively useful’ and ‘uncontrollable’ capability regimes”. (At least, I haven’t made strong updates in either direction.)
Some ways in which I disagree with this post/feel this post might have become less representative of control over time:
Nit: The post seems to use “scheming” in the sense of “worst case schemers”. As discussed elsewhere, this seems good to make explicit to disambiguate from other senses of scheming, e.g. power-motivated instrumental training-gaming.
The case for focusing on black-box control
I’m somewhat excited about white-box control (e.g. coup probes, weight interp, steering). I also sense that white-box techniques may be more back on the table than when this post was written, but not sure. A more general shift in this direction might be interest in “non-conservative control” which includes low assurance techniques which are cheap enough to just add to the toolkit.
“We think we probably won’t run into issues related to qualitatively wildly superhuman skills in problematic domains”, e.g. hacking, persuasion
My sense is that this claim about the necessary capability profile for control has become less loadbearing over time (perhaps because there’s reason to believe that untrusted monitoring might just work out fine even for e.g. superhuman hackers?)
Also, diffuse/low-stakes control focuses on the threat model where AI might give “normal/innocent looking” bad actions which are hard to catch/incriminate but we can withstand a large number of such bad actions. Measures against this threat model might address some worries with oversight here.
This post doesn’t elaborate much on ways in which one might model AIs to intentionally subvert safety measures, e.g. research sabotage.
There seems to have been more work on training-based control than is represented in this post.
I think it would be cool to have a more in-depth retrospective of ways in which control has developed/changed since this post. I won’t attempt to do that here.