We are open to feedback that might convince us to focus on these directions instead of monitorability.
What is the theory of impact for monitorability? It seems to be an even weaker technique than mechanistic interpretability, which has at best a dubious ToI.
Since monitoring is pretty superficial, it doesn’t give you a signal which you can use to optimize the model directly (that would be The Most Forbidden Technique) in the limit of superintelligence. My take on monitoring is that at best it allows you to sound an alarm if you catch your AI doing something bad.
So then what’s the ToI for sounding an alarm? Maybe some people are convinced, but unless your alarm is so convincing (and bare in mind that lots of people will have large financial incentives to remain unconvinced) that you either convince every frontier lab employee, or some very influential politicians, it’s useless. Either the lab applies patches until it looks good (which is a variant of TMFT) or another lab which doesn’t care as much comes along and builds the AI that kills us anyway.
I think most ideas surrounding “control” will cash out to optimizing the AI against the monitor in one way or another (ex: suppose we stop suspicious command line calls from being executed. If you’re using an AI’s successful outputs for RL, you’re then implicitly optimizing for calls which bypass the monitor, even if the stopped calls aren’t used as negative examples) even if they don’t seem that way.
Our ToI includes a) increasing the likelihood that companies and external parties notice when monitorability is degrading and even attempt interventions, b) finding interventions that genuinely enhance monitorability, as opposed to just making CoTs look more legible, and c) lowering the monitorability tax associated with interventions in b. Admittedly we probably can’t do all of these at the same time, and perhaps you’re more pessimistic than we are that acceptable interventions even exist.
It seems to be an even weaker technique than mechanistic interpretability, which has at best a dubious ToI.
There are certainly some senses in which CoT monitoring is weaker than mech interp: all of a model’s cognition must happen somewhere in its internals, and there’s no guarantee that all the relevant cognition appears in CoT. On the other hand, there are also important senses in which CoT monitoring is a stronger technique. When a model does a lot of CoT to solve a math problem, reading the CoT provides useful insight into how it solved the problem, and trying to figure that out from model internals alone seems much more complicated. We think it’s quite likely this transfers to safety-relevant settings like a model figuring out how to exfiltrate its own weights or subvert security measures.
We agree that directly training against monitors is generally a bad idea. We’re uncertain about whether it’s ever fine to optimize reasoning chains to be more readable (which need not require training against a monitor), though it seems plausible that there are techniques that can enhance monitorability without promoting obfuscation (see Part 2 of our agenda). More confidently, we would like frontier labs to adopt standardized monitorability evals that are used before deploying models internally and externally. The results of these evals should go in model cards and can help track whether models are monitorable.
in the limit of superintelligence
Our primary aim is to make ~human-level AIs more monitorable and trustworthy, as we believe that more trustworthy early TAIs make it more realistic that alignment research can be safely accelerated/automated. Ideally, even superintelligent AIs would have legible CoTs, but that’s not what we’re betting on with the agenda.
Either the lab applies patches until it looks good (which is a variant of TMFT) or another lab which doesn’t care as much comes along and builds the AI that kills us anyway.
It’s fair to be concerned that knowing models are becoming unmonitorable may not buy us much, but we think this is a bit too pessimistic. It seems like frontier labs are serious about preserving monitorability (e.g. OpenAI), and maybe there are some careful interventions that actually work to improve monitorability! Parts of our agenda are aimed at studying what kinds of interventions degrade vs enhance monitorability.
What is the theory of impact for monitorability? It seems to be an even weaker technique than mechanistic interpretability, which has at best a dubious ToI.
Since monitoring is pretty superficial, it doesn’t give you a signal which you can use to optimize the model directly (that would be The Most Forbidden Technique) in the limit of superintelligence. My take on monitoring is that at best it allows you to sound an alarm if you catch your AI doing something bad.
So then what’s the ToI for sounding an alarm? Maybe some people are convinced, but unless your alarm is so convincing (and bare in mind that lots of people will have large financial incentives to remain unconvinced) that you either convince every frontier lab employee, or some very influential politicians, it’s useless. Either the lab applies patches until it looks good (which is a variant of TMFT) or another lab which doesn’t care as much comes along and builds the AI that kills us anyway.
I think most ideas surrounding “control” will cash out to optimizing the AI against the monitor in one way or another (ex: suppose we stop suspicious command line calls from being executed. If you’re using an AI’s successful outputs for RL, you’re then implicitly optimizing for calls which bypass the monitor, even if the stopped calls aren’t used as negative examples) even if they don’t seem that way.
Our ToI includes a) increasing the likelihood that companies and external parties notice when monitorability is degrading and even attempt interventions, b) finding interventions that genuinely enhance monitorability, as opposed to just making CoTs look more legible, and c) lowering the monitorability tax associated with interventions in b. Admittedly we probably can’t do all of these at the same time, and perhaps you’re more pessimistic than we are that acceptable interventions even exist.
There are certainly some senses in which CoT monitoring is weaker than mech interp: all of a model’s cognition must happen somewhere in its internals, and there’s no guarantee that all the relevant cognition appears in CoT. On the other hand, there are also important senses in which CoT monitoring is a stronger technique. When a model does a lot of CoT to solve a math problem, reading the CoT provides useful insight into how it solved the problem, and trying to figure that out from model internals alone seems much more complicated. We think it’s quite likely this transfers to safety-relevant settings like a model figuring out how to exfiltrate its own weights or subvert security measures.
We agree that directly training against monitors is generally a bad idea. We’re uncertain about whether it’s ever fine to optimize reasoning chains to be more readable (which need not require training against a monitor), though it seems plausible that there are techniques that can enhance monitorability without promoting obfuscation (see Part 2 of our agenda). More confidently, we would like frontier labs to adopt standardized monitorability evals that are used before deploying models internally and externally. The results of these evals should go in model cards and can help track whether models are monitorable.
Our primary aim is to make ~human-level AIs more monitorable and trustworthy, as we believe that more trustworthy early TAIs make it more realistic that alignment research can be safely accelerated/automated. Ideally, even superintelligent AIs would have legible CoTs, but that’s not what we’re betting on with the agenda.
It’s fair to be concerned that knowing models are becoming unmonitorable may not buy us much, but we think this is a bit too pessimistic. It seems like frontier labs are serious about preserving monitorability (e.g. OpenAI), and maybe there are some careful interventions that actually work to improve monitorability! Parts of our agenda are aimed at studying what kinds of interventions degrade vs enhance monitorability.