Here are the 5 main categories of reasons why control could fail (though various of them are often correlated).
[Note: this document has a moderate amount of overlap with various prior documents Redwood has released, but it seemed like a useful document nonetheless.]
Insufficient (long horizon) sample efficiency: If your AI is scheming, you won’t be able to reasonably sample efficiently get this AI to achieve near maximal reward on average on (long horizon) tasks. By near maximal, we just mean empirically achieving average reward which is quite close to the reward the AI would get if it was “actually trying” to get high reward (insofar as it is scheming to an extent that “trying” makes sense). (When we say average, we actually mean average excluding rare outliers. We can’t control behavior on rare outliers using normal ML training.) For control, this causes issues in two places: it interferes with getting useful work out of AIs and being able to run accurate capability evaluations.
Average case usefulness: For scheming AIs, you won’t be able to check that the work it’s doing is actually good (on average) rather than just looking good to you. Note that this includes getting enough actually good work out of AIs.
Average case safety: Direct safety issues that occur due to auditing failing to notice a large number of actively dangerous actions and then these actions build up into a catastrophic outcome. (E.g., the AI slowly persuades some group in the AI lab to become AI cultists.) One particularly salient special case is:
We can’t prevent AIs from building up huge amounts of coup science: Control will depend on human researchers being similarly good or better at understanding than our countermeasures and the space of possible AI attack strategies. See Notes on control evaluations for safety cases for more discussion.
High-stakes safety: The auditing process fails to detect that a single action is very bad when it reviews that action and this is extremely bad because that action is part of a high-stakes failure (e.g., an escape attempt that occurs in a small number of actions).
Concentrated failures: inability to prevent concentrated catastrophes (safety failures which are concentrated into a small number of bad actions or a small amount of serial time) despite having an expensive and slow auditing process which could detect this bad behavior (it works for high-stakes safety) if it was assigned to review these actions.
Lack of institutional will or ability: Even if control would work given a few years of delay and an implementation carried out by a quite well-run organization, you won’t get one of these in practice and thus you’ll have trouble. (This argument also applies to any other technical approaches where the implementation requires substantial institutional will, therefore we need some sort of technical solutions which require less institutional competence.)
Catching the AI isn’t sufficiently useful: The viability of control will likely depend on it being very useful to catch the AI one or a small number of times. But, this might turn out to be false for various reasons. See Catching AIs red-handed for discussion. This might depend heavily on questions which aren’t about technical AI safety like how useful legible evidence of serious misalignment is. But, the technical usefulness of catching could suffice.
We think that various of these things could go wrong in practice. However, it seems like the current point on the logistic success curve is such that if an AI lab tried very hard to evaluate and ensure control, they could improve the situation substantially (e.g., 3-5x reduction in doom) relative to not trying very hard.
Note that 2, 3, and 4 of the above failures probably just prevent us from building sufficiently safe and useful protocols without breaking control evaluations. 1 and 2.b.i (coup science build-up) are likely to break control evaluations, though it might be possible to recognize that we’re in a world where 1 and/or 2.b.i hold. 5 has more complex effects, but we’d probably be able to safely stop after realizing that we’re unable to extract enough value from the times when we’ve caught the AI; so the main concern would be that we’re somewhat clearly unable to safely continue rather than being caught dangerously unaware.
Here are some reasons why control might not be worth working on (or actively bad to work on) regardless of the above reasons:
Control is bad for coordination: Perhaps control does improve safety, but it does so at the cost of making it easier to make excuses, giving false confidence, and/or creating slippery slopes.
Control makes it cheaper and easier to do morally problematic things: Control might make it more commercially valuable and technically feasible to employ AIs in situations where we know these AIs don’t consent to this usage (and these AIs are well described as having preferences that deserve moral consideration), and this might make violations of basic moral principles (such as slavery) more likely. (In the same way as better technology for managing human insider threats could make human slavery more commercially valuable. Of course, in the case of human slavery, there are already institutions and norms which combat slavery.) However, note that our recommendation to society would be to try extremely hard to avoid employing AIs without their consent; if AIs don’t consent to employment we should (temporarily) shut down or negotiate a deal with the AIs. Much more commentary on these topics can be seen here. Our overall view is that the default AI situation with only current alignment/safety/interpretability technology is somewhat scary from an AI welfare/rights perspective, but that control doesn’t make the situation worse.
Scheming is extremely unlikely for early transformatively useful AI (<1%) and control-style approaches won’t be a useful safeguard against other threat models: Control focuses on making conservative assumptions about the effects of training and the motivations of AIs. If we didn’t have to make such conservative assumptions, then control-type approaches look much less worthwhile. However, control-type things might still be a useful line of defense against failure modes which aren’t well described as scheming.
Baseline control strategies will work (when control is at all possible) and control evaluations are easy to run: If the baseline approaches can’t be improved on and there isn’t important research on the evaluations, then technical work on control isn’t very important. Policy/governance work might still be important. See also here.
The 5 main reasons why control could fail
Here are the 5 main categories of reasons why control could fail (though various of them are often correlated).
[Note: this document has a moderate amount of overlap with various prior documents Redwood has released, but it seemed like a useful document nonetheless.]
Insufficient (long horizon) sample efficiency: If your AI is scheming, you won’t be able to reasonably sample efficiently get this AI to achieve near maximal reward on average on (long horizon) tasks. By near maximal, we just mean empirically achieving average reward which is quite close to the reward the AI would get if it was “actually trying” to get high reward (insofar as it is scheming to an extent that “trying” makes sense). (When we say average, we actually mean average excluding rare outliers. We can’t control behavior on rare outliers using normal ML training.) For control, this causes issues in two places: it interferes with getting useful work out of AIs and being able to run accurate capability evaluations.
Auditing failures: inability for humans to sufficiently assess actions. (Though note there can be solutions to these failures which don’t involve improving auditing.) This includes:
Average case usefulness: For scheming AIs, you won’t be able to check that the work it’s doing is actually good (on average) rather than just looking good to you. Note that this includes getting enough actually good work out of AIs.
Average case safety: Direct safety issues that occur due to auditing failing to notice a large number of actively dangerous actions and then these actions build up into a catastrophic outcome. (E.g., the AI slowly persuades some group in the AI lab to become AI cultists.) One particularly salient special case is:
We can’t prevent AIs from building up huge amounts of coup science: Control will depend on human researchers being similarly good or better at understanding than our countermeasures and the space of possible AI attack strategies. See Notes on control evaluations for safety cases for more discussion.
High-stakes safety: The auditing process fails to detect that a single action is very bad when it reviews that action and this is extremely bad because that action is part of a high-stakes failure (e.g., an escape attempt that occurs in a small number of actions).
Concentrated failures: inability to prevent concentrated catastrophes (safety failures which are concentrated into a small number of bad actions or a small amount of serial time) despite having an expensive and slow auditing process which could detect this bad behavior (it works for high-stakes safety) if it was assigned to review these actions.
Lack of institutional will or ability: Even if control would work given a few years of delay and an implementation carried out by a quite well-run organization, you won’t get one of these in practice and thus you’ll have trouble. (This argument also applies to any other technical approaches where the implementation requires substantial institutional will, therefore we need some sort of technical solutions which require less institutional competence.)
Catching the AI isn’t sufficiently useful: The viability of control will likely depend on it being very useful to catch the AI one or a small number of times. But, this might turn out to be false for various reasons. See Catching AIs red-handed for discussion. This might depend heavily on questions which aren’t about technical AI safety like how useful legible evidence of serious misalignment is. But, the technical usefulness of catching could suffice.
We think that various of these things could go wrong in practice. However, it seems like the current point on the logistic success curve is such that if an AI lab tried very hard to evaluate and ensure control, they could improve the situation substantially (e.g., 3-5x reduction in doom) relative to not trying very hard.
Note that 2, 3, and 4 of the above failures probably just prevent us from building sufficiently safe and useful protocols without breaking control evaluations. 1 and 2.b.i (coup science build-up) are likely to break control evaluations, though it might be possible to recognize that we’re in a world where 1 and/or 2.b.i hold. 5 has more complex effects, but we’d probably be able to safely stop after realizing that we’re unable to extract enough value from the times when we’ve caught the AI; so the main concern would be that we’re somewhat clearly unable to safely continue rather than being caught dangerously unaware.
See here for more discussion.
Here are some reasons why control might not be worth working on (or actively bad to work on) regardless of the above reasons:
Control is bad for coordination: Perhaps control does improve safety, but it does so at the cost of making it easier to make excuses, giving false confidence, and/or creating slippery slopes.
Control makes it cheaper and easier to do morally problematic things: Control might make it more commercially valuable and technically feasible to employ AIs in situations where we know these AIs don’t consent to this usage (and these AIs are well described as having preferences that deserve moral consideration), and this might make violations of basic moral principles (such as slavery) more likely. (In the same way as better technology for managing human insider threats could make human slavery more commercially valuable. Of course, in the case of human slavery, there are already institutions and norms which combat slavery.) However, note that our recommendation to society would be to try extremely hard to avoid employing AIs without their consent; if AIs don’t consent to employment we should (temporarily) shut down or negotiate a deal with the AIs. Much more commentary on these topics can be seen here. Our overall view is that the default AI situation with only current alignment/safety/interpretability technology is somewhat scary from an AI welfare/rights perspective, but that control doesn’t make the situation worse.
Scheming is extremely unlikely for early transformatively useful AI (<1%) and control-style approaches won’t be a useful safeguard against other threat models: Control focuses on making conservative assumptions about the effects of training and the motivations of AIs. If we didn’t have to make such conservative assumptions, then control-type approaches look much less worthwhile. However, control-type things might still be a useful line of defense against failure modes which aren’t well described as scheming.
Baseline control strategies will work (when control is at all possible) and control evaluations are easy to run: If the baseline approaches can’t be improved on and there isn’t important research on the evaluations, then technical work on control isn’t very important. Policy/governance work might still be important. See also here.