Thanks John, I think that this is a reasonable articulation of the main limitations of AI control as I’ve described it. But I find it very unconvincing as an argument that AI control should be less than, say, 20% of total research effort on safety.
My central crux is that preventing scheming early transformative AIs from succeeding at extreme subversion (e.g. escaping or massively sabotaging research due to scheming) seems quite robustly good for reducing AI risk. I think this problem is important enough that it deserves a substantial share of AI safety research and implementation effort. I think that my stance here is pretty common-sensical, and to some extent I feel like there’s a burden of proof on skeptics to argue that scheming early transformative AI isn’t a problem, that we should just ignore anything that goes wrong as a result of them intentionally trying to subvert us.
I haven’t really justified a quantitative estimate of the importance of this in detail anywhere, both because I think it’s pretty intuitively defensible, and also because it’s pretty complicated to talk about: it requires forecasting detailed dynamics of the development of powerful AI (e.g. you have to argue about whether escaped scheming AIs might force AI developers to build uncontrollable superintelligences in cases where they otherwise wouldn’t have, etc).
Note that I consider the problem of “get useful research out of AIs that are scheming against you” to be in scope for AI control research. We’ve mostly studied the “prevent scheming AIs from causing concentrated catastrophes” problem in the past, because it’s simpler for various technical reasons. But we’re hoping to do more on the “get useful research” problem in future (and Benton et al is an initial step in this direction). (I’m also excited for work on “get useful research out of AIs that aren’t scheming against you”; I think that the appropriate techniques are somewhat different depending on whether the models are scheming or not, which is why I suggest studying them separately.)
Some other nitpicks:
At the risk of vagueness: I think that AI control techniques probably reduce risks when applied to models that are “moderately superintelligent”, whatever that means. Like, I don’t think they’re literally only applicable to models that aren’t superintelligent at all.
As we’ve discussed before, some non-scheming threat models do seem like they’d be addressed by control as a special case.
Note that I consider the problem of “get useful research out of AIs that are scheming against you” to be in scope for AI control research. We’ve mostly studied the “prevent scheming AIs from causing concentrated catastrophes” problem in the past, because it’s simpler for various technical reasons. But we’re hoping to do more on the “get useful research” problem in future (and Benton et al is an initial step in this direction). (I’m also excited for work on “get useful research out of AIs that aren’t scheming against you”; I think that the appropriate techniques are somewhat different depending on whether the models are scheming or not, which is why I suggest studying them separately.)
What would you say to the objection that people will immediately try to use such techniques to speed up ASI research just as much as they will try to use them to speed up alignment research if not more? Meaning they wouldn’t help close the gap between alignment research and ASI development and might even make it grow larger faster?
If we were not expecting to solve the alignment problem before ASI is developed in a world where nobody knows how to get very useful research out of AIs, why would techniques for getting useful research out of AIs speed up alignment research more than ASI research and close the gap?
I think this is a pretty general counterargument to any intervention that increases the space of options that AI developers have during the early singularity. Which isn’t to say that it’s wrong.
In general, I think that it is a pretty good bet to develop safety methods that improve the safety-usefulness tradeoffs available when using AI. I think this looks good in all the different worlds I described here.
I am commenting more on your proposal to solve the “get useful research” problem here than the “get useful research out of AIs that are scheming against you” problem, though I do think this objection applies to both. I can see a world in which misalignment and scheming of early AGI is an actual blocker to their usefulness in research and other domains with sparse feedback in a very obvious and salient way. In that world, solving the “get useful research out of AIs that are scheming against you” problem ramps economic incentives for making smarter AIs up further.
I think that this is a pretty general counterargument to most game plans for alignment that don’t include a step of “And then when get a pause on ASI development somehow” at some point in the plan.
Note that I don’t consider “get useful research, assuming that the model isn’t scheming” to be an AI control problem.
AI control isn’t a complete game plan and doesn’t claim to be; it’s just a research direction that I think is a tractable approach to making the situation safer, and that I think should be a substantial fraction of AI safety research effort.
I feel like I might be missing something, but conditional on scheming isn’t it differentially useful for safety because by default scheming AIs would be more likely to sandbag on safety research than capabilities research?
That’s not clear to me? Unless they have a plan to ensure future ASIs are aligned with them or meaningfully negotiate with them, ASIs seem just as likely to wipe out any earlier non-superhuman AGIs as they are to wipe out humanity.
I can come up with specific scenarios where they’d be more interested in sabotaging safety research than capabilities research, as well as the reverse, but it’s not evident to me that the combined probability mass of the former outweighs the latter or vice-versa.
If someone has an argument for this I would be interested in reading it.
Note that there are two problems Buck is highlighting here:
Get useful work out of scheming models that might try to sabotage this work.
Get useful research work out of models which aren’t scheming. (Where perhaps the main problem is in checking its outputs.)
My sense is that work on (1) doesn’t advance ASI research except to the extent that scheming AIs would have tried to sabotage this research? (At least, insofar as work on (1) doesn’t help much with (2) in the long run.)
It’s been about 8 months since this post and Buck’s comment above.
At the time, I didn’t bother replying to Buck’s comment because it didn’t really say much. My post basically said “this control agenda doesn’t seem to address anything important”, illustrated with a slew of specific examples of ways-things-go-wrong which seem-to-me to account for far more probability mass than scheming in early transformative AGI. Buck’s response was basically “yeah, those are real limitations, but IDK man scheming seems intuitively important?”. There’s a substantive argument to be had here about why I expect scheming of early transformative AGI to be either unlikely or easy to fix (relative to other problems), whereas Buck expects the opposite, and we haven’t had that debate.
Anyway, I’m leaving this comment now because I think some people saw that I had a critique, that Buck had responded, and that I hadn’t responded back, and therefore assumed that the ball was in my court and I wasn’t engaging. That’s not my understanding of what’s going on here; I think Buck has basically not argued back substantively against the core of the critique, but also I haven’t argued strongly against his core crux either. We’ve identified a crux, and that’s the state of things. (And to be clear that’s fine, not every debate is worth having.)
Thanks John, I think that this is a reasonable articulation of the main limitations of AI control as I’ve described it. But I find it very unconvincing as an argument that AI control should be less than, say, 20% of total research effort on safety.
My central crux is that preventing scheming early transformative AIs from succeeding at extreme subversion (e.g. escaping or massively sabotaging research due to scheming) seems quite robustly good for reducing AI risk. I think this problem is important enough that it deserves a substantial share of AI safety research and implementation effort. I think that my stance here is pretty common-sensical, and to some extent I feel like there’s a burden of proof on skeptics to argue that scheming early transformative AI isn’t a problem, that we should just ignore anything that goes wrong as a result of them intentionally trying to subvert us.
I haven’t really justified a quantitative estimate of the importance of this in detail anywhere, both because I think it’s pretty intuitively defensible, and also because it’s pretty complicated to talk about: it requires forecasting detailed dynamics of the development of powerful AI (e.g. you have to argue about whether escaped scheming AIs might force AI developers to build uncontrollable superintelligences in cases where they otherwise wouldn’t have, etc).
Note that I consider the problem of “get useful research out of AIs that are scheming against you” to be in scope for AI control research. We’ve mostly studied the “prevent scheming AIs from causing concentrated catastrophes” problem in the past, because it’s simpler for various technical reasons. But we’re hoping to do more on the “get useful research” problem in future (and Benton et al is an initial step in this direction). (I’m also excited for work on “get useful research out of AIs that aren’t scheming against you”; I think that the appropriate techniques are somewhat different depending on whether the models are scheming or not, which is why I suggest studying them separately.)
Some other nitpicks:
At the risk of vagueness: I think that AI control techniques probably reduce risks when applied to models that are “moderately superintelligent”, whatever that means. Like, I don’t think they’re literally only applicable to models that aren’t superintelligent at all.
As we’ve discussed before, some non-scheming threat models do seem like they’d be addressed by control as a special case.
What would you say to the objection that people will immediately try to use such techniques to speed up ASI research just as much as they will try to use them to speed up alignment research if not more? Meaning they wouldn’t help close the gap between alignment research and ASI development and might even make it grow larger faster?
If we were not expecting to solve the alignment problem before ASI is developed in a world where nobody knows how to get very useful research out of AIs, why would techniques for getting useful research out of AIs speed up alignment research more than ASI research and close the gap?
I think this is a pretty general counterargument to any intervention that increases the space of options that AI developers have during the early singularity. Which isn’t to say that it’s wrong.
In general, I think that it is a pretty good bet to develop safety methods that improve the safety-usefulness tradeoffs available when using AI. I think this looks good in all the different worlds I described here.
I am commenting more on your proposal to solve the “get useful research” problem here than the “get useful research out of AIs that are scheming against you” problem, though I do think this objection applies to both. I can see a world in which misalignment and scheming of early AGI is an actual blocker to their usefulness in research and other domains with sparse feedback in a very obvious and salient way. In that world, solving the “get useful research out of AIs that are scheming against you” problem ramps economic incentives for making smarter AIs up further.
I think that this is a pretty general counterargument to most game plans for alignment that don’t include a step of “And then when get a pause on ASI development somehow” at some point in the plan.
Note that I don’t consider “get useful research, assuming that the model isn’t scheming” to be an AI control problem.
AI control isn’t a complete game plan and doesn’t claim to be; it’s just a research direction that I think is a tractable approach to making the situation safer, and that I think should be a substantial fraction of AI safety research effort.
I feel like I might be missing something, but conditional on scheming isn’t it differentially useful for safety because by default scheming AIs would be more likely to sandbag on safety research than capabilities research?
That’s not clear to me? Unless they have a plan to ensure future ASIs are aligned with them or meaningfully negotiate with them, ASIs seem just as likely to wipe out any earlier non-superhuman AGIs as they are to wipe out humanity.
I can come up with specific scenarios where they’d be more interested in sabotaging safety research than capabilities research, as well as the reverse, but it’s not evident to me that the combined probability mass of the former outweighs the latter or vice-versa.
If someone has an argument for this I would be interested in reading it.
I found some prior relevant work and tagged them in https://www.lesswrong.com/tag/successor-alignment. I found the top few comments on https://www.lesswrong.com/posts/axKWaxjc2CHH5gGyN/ai-will-not-want-to-self-improve#comments and https://www.lesswrong.com/posts/wZAa9fHZfR6zxtdNx/agi-systems-and-humans-will-both-need-to-solve-the-alignment#comments helpful.
edit: another effect to keep in mind is that capabilities research may be harder to sandbag on because of more clear metrics.
Note that there are two problems Buck is highlighting here:
Get useful work out of scheming models that might try to sabotage this work.
Get useful research work out of models which aren’t scheming. (Where perhaps the main problem is in checking its outputs.)
My sense is that work on (1) doesn’t advance ASI research except to the extent that scheming AIs would have tried to sabotage this research? (At least, insofar as work on (1) doesn’t help much with (2) in the long run.)
It’s been about 8 months since this post and Buck’s comment above.
At the time, I didn’t bother replying to Buck’s comment because it didn’t really say much. My post basically said “this control agenda doesn’t seem to address anything important”, illustrated with a slew of specific examples of ways-things-go-wrong which seem-to-me to account for far more probability mass than scheming in early transformative AGI. Buck’s response was basically “yeah, those are real limitations, but IDK man scheming seems intuitively important?”. There’s a substantive argument to be had here about why I expect scheming of early transformative AGI to be either unlikely or easy to fix (relative to other problems), whereas Buck expects the opposite, and we haven’t had that debate.
Anyway, I’m leaving this comment now because I think some people saw that I had a critique, that Buck had responded, and that I hadn’t responded back, and therefore assumed that the ball was in my court and I wasn’t engaging. That’s not my understanding of what’s going on here; I think Buck has basically not argued back substantively against the core of the critique, but also I haven’t argued strongly against his core crux either. We’ve identified a crux, and that’s the state of things. (And to be clear that’s fine, not every debate is worth having.)