In 2014, we proposed that researchers try to find ways to make highly capable AIs “corrigible,” or “able to be corrected.” The idea would be to build AIs in such a way that they reliably want to *help *and cooperate with their programmers, rather than hinder them — even as they become smarter and more powerful, and even though they aren’t yet perfectly aligned.
Minimally, corrigibility just means present goals can be changed. There could be crude, brute force ways of doing that, like Pavlovian conditioning ,or overwriting an explicit coded UF.
The whole point of corrigibility is to scale to novel contexts and new capability regimes.
That can be done gradually, if everything is gradual.
The reason that MIRI wasn’t able to make corrigibility work is that corrigibility is basically a silly thing to want, I can’t really think of any system in the (large) human world which needs perfectly corrigible parts, i.e. humans whose motivations can be arbitrarily reprogrammed.
That’s normal much an argument against corrigibility, as an argument against perfect corrigibillity.
Minimally, corrigibility just means present goals can be changed. There could be crude, brute force ways of doing that, like Pavlovian conditioning ,or overwriting an explicit coded UF.
That can be done gradually, if everything is gradual.
@Roko
That’s normal much an argument against corrigibility, as an argument against perfect corrigibillity.