I think there are reasonable people who look at the evidence and think it plausible that control works, and also reasonable people who look at the evidence and think it implausible that control works. And others who think that openai-superalignment-style plans plausibly work.
Minor aside, but I’m not sure I’ve ever heard someone reasonable (e.g. a Redwood employee) say “Control Works” in the sense of “Stuff like existing monitoring will work to control AIs forever”. I’ve only ever heard control talked about as a stopgap for a fairly narrow set of ~human capabilities, which allows us to *something something* solve alignment. It’s the second part that seems to be where the major disagreements come in.
I’m not sure about superalignment since they published one paper and it wasn’t even really alignment, and then all the reasonable ones I can think of got fired.
By “control plausibly works” I didn’t mean “Stuff like existing monitoring will work to control AIs forever”. I meant it works if it is a stepping stone allows us to accelerate/finish alignment research, and thereby build aligned AGI.
This fits with my model of what most control researchers believe, where the main point of disagreement is on what “accelerate/finish alignment research” entails. I think it’s important that the disagreement isn’t actually about control, but about the much more complicated topic of alignment given control.
Edited from here on for tone and clarity:
I think it is unsurprising for the people who work on control to disagree with other people on whether control allows you to finish alignment research, since this is mostly a question of “Given X affordances, could we solve alignment.” This is fundamentally a question about alignment, which is a very difficult topic to agree on. Many of the smart people who believe in control are currently working on it, and thus are not currently working on alignment, so their intuitions about what is required to solve alignment are likely to be in tension with the intuitions of those who are working in alignment.
(There’s also some disagreements regarding FOOM and other capability discontinuities, since a FOOM most likely causes control to break down. The likelihood of FOOM is something I am still confused about so I don’t know which side is actually reasonable here. My current position is that if you have an alignment method which actually looks like it will work for AIs that grow <10x the speed of the current capabilities exponential exponential, that would be brilliant and definitely worth trying all on its own, even if you can’t adapt it to faster growing AIs)
I’m using “alignment” here pretty losely, but basically I mean most of the stuff on the agent foundations <-> mech interp spectrum/region/cloud. I specifically exclude control research from this because I haven’t seen anyone reasonable claim that control → more control → even more control is a good idea (the closest I’ve seen is William MacAskill saying something along those lines, but even he thinks we need some alignment).
Do you think it’s similar to how different expectations and priors about what AI trajectories and capability profiles will look like often cause people to make different predictions on e.g. P(doom), P(scheming), etc. (like the Paul vs Eliezer debates)? Or do you think in this case there is enough empirical evidence that people ought to converge more? (I’d guess the former, but low confidence.)
I think several of the subquestions that matter for whether it’ll plausibly work to have AI solve alignment for us are in the second category. Like the two points I mentioned in the post. I think there are other subquestions that are more in the first category, which are also relevant to the odds of success. I’m relatively low confidence about this kind of stuff because of all the normal reasons why it’s difficult to say how other people should be thinking. It’s easy to miss relevant priors, evidence, etc. But still… given what I know about what everyone believes, it looks like these questions should be resolvable among reasonable people.
The more boring (and likely case) is that we just have too few data-points to tell whether AI control can actually work as it’s supposed to, so we have to mostly fall back on priors.
I’ll flag something from J Bostock’s comment here while I’m making the comment:
I’ve only ever heard control talked about as a stopgap for a fairly narrow set of ~human capabilities, which allows us to something something solve alignment.
The human range of capabilities is actually quite large (discussed in SSC).
I think there are reasonable people who look at the evidence and think it plausible that control works, and also reasonable people who look at the evidence and think it implausible that control works. And others who think that openai-superalignment-style plans plausibly work.
Something is going wrong here.
Minor aside, but I’m not sure I’ve ever heard someone reasonable (e.g. a Redwood employee) say “Control Works” in the sense of “Stuff like existing monitoring will work to control AIs forever”. I’ve only ever heard control talked about as a stopgap for a fairly narrow set of ~human capabilities, which allows us to *something something* solve alignment. It’s the second part that seems to be where the major disagreements come in.
I’m not sure about superalignment since they published one paper and it wasn’t even really alignment, and then all the reasonable ones I can think of got fired.
By “control plausibly works” I didn’t mean “Stuff like existing monitoring will work to control AIs forever”. I meant it works if it is a stepping stone allows us to accelerate/finish alignment research, and thereby build aligned AGI.
This fits with my model of what most control researchers believe, where the main point of disagreement is on what “accelerate/finish alignment research” entails. I think it’s important that the disagreement isn’t actually about control, but about the much more complicated topic of alignment given control.
Edited from here on for tone and clarity:
I think it is unsurprising for the people who work on control to disagree with other people on whether control allows you to finish alignment research, since this is mostly a question of “Given X affordances, could we solve alignment.” This is fundamentally a question about alignment, which is a very difficult topic to agree on. Many of the smart people who believe in control are currently working on it, and thus are not currently working on alignment, so their intuitions about what is required to solve alignment are likely to be in tension with the intuitions of those who are working in alignment.
(There’s also some disagreements regarding FOOM and other capability discontinuities, since a FOOM most likely causes control to break down. The likelihood of FOOM is something I am still confused about so I don’t know which side is actually reasonable here. My current position is that if you have an alignment method which actually looks like it will work for AIs that grow <10x the speed of the current capabilities exponential exponential, that would be brilliant and definitely worth trying all on its own, even if you can’t adapt it to faster growing AIs)
I’m using “alignment” here pretty losely, but basically I mean most of the stuff on the agent foundations <-> mech interp spectrum/region/cloud. I specifically exclude control research from this because I haven’t seen anyone reasonable claim that control → more control → even more control is a good idea (the closest I’ve seen is William MacAskill saying something along those lines, but even he thinks we need some alignment).
Do you think it’s similar to how different expectations and priors about what AI trajectories and capability profiles will look like often cause people to make different predictions on e.g. P(doom), P(scheming), etc. (like the Paul vs Eliezer debates)? Or do you think in this case there is enough empirical evidence that people ought to converge more? (I’d guess the former, but low confidence.)
I think several of the subquestions that matter for whether it’ll plausibly work to have AI solve alignment for us are in the second category. Like the two points I mentioned in the post. I think there are other subquestions that are more in the first category, which are also relevant to the odds of success. I’m relatively low confidence about this kind of stuff because of all the normal reasons why it’s difficult to say how other people should be thinking. It’s easy to miss relevant priors, evidence, etc. But still… given what I know about what everyone believes, it looks like these questions should be resolvable among reasonable people.
The more boring (and likely case) is that we just have too few data-points to tell whether AI control can actually work as it’s supposed to, so we have to mostly fall back on priors.
I’ll flag something from J Bostock’s comment here while I’m making the comment:
The human range of capabilities is actually quite large (discussed in SSC).