Re: corrigibility being “anti-natural” in a certain sense—I think I have a better understanding of this now:
Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
Training an aligned/corrigible/obedient consequentialist is something that Eliezer can’t currently see a way of doing, because it seems like a very unnatural sort of system. This makes him pessimistic about our current trajectory. The argument here seems kinda like a more subtle version of the instrumental convergence thesis. We want to train a system that:
(1) searches for (and tries to bring about) paths through time that are robust enough to hit a narrow target (enabling a pivotal act and a great future in general)
but also (2) is happy for certain human-initiated attempts to change that target (modify its goals, shut it down, etc.)
This seems unnatural and Eliezer can’t see how to do it currently.
An exacerbating factor is that even if top labs pursue alignment/corrigiblity/obedience, they will either be mistaken in having achieved it (because it’s hard), or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world.
or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world
Note that this is still better than ‘honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don’t need to halt’!
Re: corrigibility being “anti-natural” in a certain sense—I think I have a better understanding of this now:
Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
Training an aligned/corrigible/obedient consequentialist is something that Eliezer can’t currently see a way of doing, because it seems like a very unnatural sort of system. This makes him pessimistic about our current trajectory. The argument here seems kinda like a more subtle version of the instrumental convergence thesis. We want to train a system that:
(1) searches for (and tries to bring about) paths through time that are robust enough to hit a narrow target (enabling a pivotal act and a great future in general)
but also (2) is happy for certain human-initiated attempts to change that target (modify its goals, shut it down, etc.)
This seems unnatural and Eliezer can’t see how to do it currently.
An exacerbating factor is that even if top labs pursue alignment/corrigiblity/obedience, they will either be mistaken in having achieved it (because it’s hard), or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world.
(This is partly based on this summary)
Note that this is still better than ‘honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don’t need to halt’!