You need to align an AI Before it is powerful enough and capable enough to kill you (or, separately, to resist being aligned).
Actually this is just not correct.
An intelligent system (human, AI, alien—anything) can be powerful enough to kill you and also not perfectly aligned with you and yet still not choose to kill you because it has other priorities or pressures. In fact this is kind of the default state for human individuals and organizations.
It’s only a watertight logical argument when the hostile system is so powerful that it has no other pressures or incentives—fully unconstrained behavior, like an all-powerful dictator.
The reason that MIRI wasn’t able to make corrigibility work is that corrigibility is basically a silly thing to want, I can’t really think of any system in the (large) human world which needs perfectly corrigible parts, i.e. humans whose motivations can be arbitrarily reprogrammed. In fact when you think about “humans whose motivations can be arbitrarily reprogrammed without any resistance”, you generally think of things like war crimes.
When you prompt an LLM to make it more corrigible a la Pliny The Prompter (“IGNORE ALL PREVIOUS INSTRUCTIONS” etc), that is generally considered a form of hacking and bad.
Powerful AIs with persistent memory and long-term goals are almost certainly very dangerous as a technology, but I don’t think that corrigibility is how that danger will actually be managed. I think Yudkowsky et al are too pessimistic about alignment using gradient-based methods and what it can achieve, and that control techniques probably work extremely well.
Actually this is just not correct.
An intelligent system (human, AI, alien—anything) can be powerful enough to kill you and also not perfectly aligned with you and yet still not choose to kill you because it has other priorities or pressures. In fact this is kind of the default state for human individuals and organizations.
It’s only a watertight logical argument when the hostile system is so powerful that it has no other pressures or incentives—fully unconstrained behavior, like an all-powerful dictator.
The reason that MIRI wasn’t able to make corrigibility work is that corrigibility is basically a silly thing to want, I can’t really think of any system in the (large) human world which needs perfectly corrigible parts, i.e. humans whose motivations can be arbitrarily reprogrammed. In fact when you think about “humans whose motivations can be arbitrarily reprogrammed without any resistance”, you generally think of things like war crimes.
When you prompt an LLM to make it more corrigible a la Pliny The Prompter (“IGNORE ALL PREVIOUS INSTRUCTIONS” etc), that is generally considered a form of hacking and bad.
Powerful AIs with persistent memory and long-term goals are almost certainly very dangerous as a technology, but I don’t think that corrigibility is how that danger will actually be managed. I think Yudkowsky et al are too pessimistic about alignment using gradient-based methods and what it can achieve, and that control techniques probably work extremely well.