Sam Clarke comments on Comments on Carlsmith’s “Is power-seeking AI an existential risk?”

Sam Clarke 23 Feb 2022 13:12 UTC
LW: 11 AF: 6
0
AF
Re: corrigibility being “anti-natural” in a certain sense—I think I have a better understanding of this now:
- Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
- AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
- Training an aligned/corrigible/obedient consequentialist is something that Eliezer can’t currently see a way of doing, because it seems like a very unnatural sort of system. This makes him pessimistic about our current trajectory. The argument here seems kinda like a more subtle version of the instrumental convergence thesis. We want to train a system that:
  - (1) searches for (and tries to bring about) paths through time that are robust enough to hit a narrow target (enabling a pivotal act and a great future in general)
  - but also (2) is happy for certain human-initiated attempts to change that target (modify its goals, shut it down, etc.)
- This seems unnatural and Eliezer can’t see how to do it currently.
- An exacerbating factor is that even if top labs pursue alignment/corrigiblity/obedience, they will either be mistaken in having achieved it (because it’s hard), or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world.
- (This is partly based on this summary)
What links here?
- Sam Clarke's comment on Late 2021 MIRI Conversations: AMA / Discussion by Rob Bensinger (2 Mar 2022 14:33 UTC; 5 points)
- Sam Clarke's comment on Inner Alignment: Explain like I’m 12 Edition by Rafael Harth (23 Feb 2022 13:15 UTC; 1 point)
- Rob Bensinger 23 Feb 2022 22:34 UTC
  LW: 5 AF: 3
  0
  AF Parent
  or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world
  Note that this is still better than ‘honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don’t need to halt’!