Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.
which if true should preclude strong confidence in disaster scenarios
Though only for disaster scenarios that rely on inner misalignment, right?
… seem like world models that make sense to me, given the surrounding justifications
FWIW, I don’t really understand those world models/intuitions yet:
Re: “earlier patches not generalising as well as the deep algorithms”—I don’t understand/am sceptical about the abstraction of “earlier patches” vs. “deep algorithms learned as intelligence is scaled up”. What seem to be dubbed “patches that won’t generalise well” seem to me to be more like “plausibly successful shaping of the model’s goals”. I don’t see why, at some point when the model gets sufficiently smart, gradient descent will get it to throw out the goals it used to have. What am I missing?
Re: corrigibility being “anti-natural” in a certain sense—I think I just don’t understand this at all. Has it been discussed clearly anywhere else?
(jtbc, I think inner misalignment might be a big problem, I just haven’t seen any good argument for it plausibly being the main problem)
Re: corrigibility being “anti-natural” in a certain sense—I think I have a better understanding of this now:
Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
Training an aligned/corrigible/obedient consequentialist is something that Eliezer can’t currently see a way of doing, because it seems like a very unnatural sort of system. This makes him pessimistic about our current trajectory. The argument here seems kinda like a more subtle version of the instrumental convergence thesis. We want to train a system that:
(1) searches for (and tries to bring about) paths through time that are robust enough to hit a narrow target (enabling a pivotal act and a great future in general)
but also (2) is happy for certain human-initiated attempts to change that target (modify its goals, shut it down, etc.)
This seems unnatural and Eliezer can’t see how to do it currently.
An exacerbating factor is that even if top labs pursue alignment/corrigiblity/obedience, they will either be mistaken in having achieved it (because it’s hard), or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world.
or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world
Note that this is still better than ‘honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don’t need to halt’!
Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.
Though only for disaster scenarios that rely on inner misalignment, right?
FWIW, I don’t really understand those world models/intuitions yet:
Re: “earlier patches not generalising as well as the deep algorithms”—I don’t understand/am sceptical about the abstraction of “earlier patches” vs. “deep algorithms learned as intelligence is scaled up”. What seem to be dubbed “patches that won’t generalise well” seem to me to be more like “plausibly successful shaping of the model’s goals”. I don’t see why, at some point when the model gets sufficiently smart, gradient descent will get it to throw out the goals it used to have. What am I missing?
Re: corrigibility being “anti-natural” in a certain sense—I think I just don’t understand this at all. Has it been discussed clearly anywhere else?
(jtbc, I think inner misalignment might be a big problem, I just haven’t seen any good argument for it plausibly being the main problem)
Re: corrigibility being “anti-natural” in a certain sense—I think I have a better understanding of this now:
Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
Training an aligned/corrigible/obedient consequentialist is something that Eliezer can’t currently see a way of doing, because it seems like a very unnatural sort of system. This makes him pessimistic about our current trajectory. The argument here seems kinda like a more subtle version of the instrumental convergence thesis. We want to train a system that:
(1) searches for (and tries to bring about) paths through time that are robust enough to hit a narrow target (enabling a pivotal act and a great future in general)
but also (2) is happy for certain human-initiated attempts to change that target (modify its goals, shut it down, etc.)
This seems unnatural and Eliezer can’t see how to do it currently.
An exacerbating factor is that even if top labs pursue alignment/corrigiblity/obedience, they will either be mistaken in having achieved it (because it’s hard), or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world.
(This is partly based on this summary)
Note that this is still better than ‘honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don’t need to halt’!