Sam Clarke comments on Comments on Carlsmith’s “Is power-seeking AI an existential risk?”

Sam Clarke 17 Nov 2021 14:16 UTC
LW: 12 AF: 6
AF
Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.

which if true should preclude strong confidence in disaster scenarios

Though only for disaster scenarios that rely on inner misalignment, right?

… seem like world models that make sense to me, given the surrounding justifications

FWIW, I don’t really understand those world models/intuitions yet:
- Re: “earlier patches not generalising as well as the deep algorithms”—I don’t understand/am sceptical about the abstraction of “earlier patches” vs. “deep algorithms learned as intelligence is scaled up”. What seem to be dubbed “patches that won’t generalise well” seem to me to be more like “plausibly successful shaping of the model’s goals”. I don’t see why, at some point when the model gets sufficiently smart, gradient descent will get it to throw out the goals it used to have. What am I missing?
- Re: corrigibility being “anti-natural” in a certain sense—I think I just don’t understand this at all. Has it been discussed clearly anywhere else?
(jtbc, I think inner misalignment might be a big problem, I just haven’t seen any good argument for it plausibly being the main problem)
- Sam Clarke 23 Feb 2022 13:12 UTC
  LW: 11 AF: 6
  AF Parent
  Re: corrigibility being “anti-natural” in a certain sense—I think I have a better understanding of this now:
  - Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
  - AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
  - Training an aligned/corrigible/obedient consequentialist is something that Eliezer can’t currently see a way of doing, because it seems like a very unnatural sort of system. This makes him pessimistic about our current trajectory. The argument here seems kinda like a more subtle version of the instrumental convergence thesis. We want to train a system that:
    (1) searches for (and tries to bring about) paths through time that are robust enough to hit a narrow target (enabling a pivotal act and a great future in general)
    but also (2) is happy for certain human-initiated attempts to change that target (modify its goals, shut it down, etc.)
  - This seems unnatural and Eliezer can’t see how to do it currently.
  - An exacerbating factor is that even if top labs pursue alignment/corrigiblity/obedience, they will either be mistaken in having achieved it (because it’s hard), or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world.
  - (This is partly based on this summary)
  What links here?
  - Sam Clarke's comment on Late 2021 MIRI Conversations: AMA / Discussion by Rob Bensinger (2 Mar 2022 14:33 UTC; 5 points)
  - Sam Clarke's comment on Inner Alignment: Explain like I’m 12 Edition by Rafael Harth (23 Feb 2022 13:15 UTC; 1 point)
  - Rob Bensinger 23 Feb 2022 22:34 UTC
    LW: 5 AF: 3
    AF Parent
    or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world
    Note that this is still better than ‘honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don’t need to halt’!