By “control plausibly works” I didn’t mean “Stuff like existing monitoring will work to control AIs forever”. I meant it works if it is a stepping stone allows us to accelerate/finish alignment research, and thereby build aligned AGI.
This fits with my model of what most control researchers believe, where the main point of disagreement is on what “accelerate/finish alignment research” entails. I think it’s important that the disagreement isn’t actually about control, but about the much more complicated topic of alignment given control.
Edited from here on for tone and clarity:
I think it is unsurprising for the people who work on control to disagree with other people on whether control allows you to finish alignment research, since this is mostly a question of “Given X affordances, could we solve alignment.” This is fundamentally a question about alignment, which is a very difficult topic to agree on. Many of the smart people who believe in control are currently working on it, and thus are not currently working on alignment, so their intuitions about what is required to solve alignment are likely to be in tension with the intuitions of those who are working in alignment.
(There’s also some disagreements regarding FOOM and other capability discontinuities, since a FOOM most likely causes control to break down. The likelihood of FOOM is something I am still confused about so I don’t know which side is actually reasonable here. My current position is that if you have an alignment method which actually looks like it will work for AIs that grow <10x the speed of the current capabilities exponential exponential, that would be brilliant and definitely worth trying all on its own, even if you can’t adapt it to faster growing AIs)
I’m using “alignment” here pretty losely, but basically I mean most of the stuff on the agent foundations <-> mech interp spectrum/region/cloud. I specifically exclude control research from this because I haven’t seen anyone reasonable claim that control → more control → even more control is a good idea (the closest I’ve seen is William MacAskill saying something along those lines, but even he thinks we need some alignment).
By “control plausibly works” I didn’t mean “Stuff like existing monitoring will work to control AIs forever”. I meant it works if it is a stepping stone allows us to accelerate/finish alignment research, and thereby build aligned AGI.
This fits with my model of what most control researchers believe, where the main point of disagreement is on what “accelerate/finish alignment research” entails. I think it’s important that the disagreement isn’t actually about control, but about the much more complicated topic of alignment given control.
Edited from here on for tone and clarity:
I think it is unsurprising for the people who work on control to disagree with other people on whether control allows you to finish alignment research, since this is mostly a question of “Given X affordances, could we solve alignment.” This is fundamentally a question about alignment, which is a very difficult topic to agree on. Many of the smart people who believe in control are currently working on it, and thus are not currently working on alignment, so their intuitions about what is required to solve alignment are likely to be in tension with the intuitions of those who are working in alignment.
(There’s also some disagreements regarding FOOM and other capability discontinuities, since a FOOM most likely causes control to break down. The likelihood of FOOM is something I am still confused about so I don’t know which side is actually reasonable here. My current position is that if you have an alignment method which actually looks like it will work for AIs that grow <10x the speed of the current capabilities exponential exponential, that would be brilliant and definitely worth trying all on its own, even if you can’t adapt it to faster growing AIs)
I’m using “alignment” here pretty losely, but basically I mean most of the stuff on the agent foundations <-> mech interp spectrum/region/cloud. I specifically exclude control research from this because I haven’t seen anyone reasonable claim that control → more control → even more control is a good idea (the closest I’ve seen is William MacAskill saying something along those lines, but even he thinks we need some alignment).