Noosphere89 comments on Counterarguments to the basic AI x-risk case

Noosphere89 16 Oct 2022 16:32 UTC
12 points
9

To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late—the AI needs a robust model of human values well before it becomes superhuman.

Katjas point is valid—DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )

I agree Eliezer is wrong, though that’s not enough to ensure success. In particular, you need to avoid inner alignment issues like deceptive alignment, where it learns values very well only for instrumental convergence reasons, and once it’s strong, it overthrows the humans and pursues whatever terminal goal it has.
- jacob_cannell 16 Oct 2022 17:30 UTC
  6 points
  −1
  Parent
  Sim boxing can solve deceptive alignment (and may be the only viable solution)
  - Noosphere89 16 Oct 2022 17:35 UTC
    3 points
    1
    Parent
    I agree that boxing is at least a first step, so that it doesn’t get more compute, or worse, FOOM.
    
    The tricky problem is we need to be able to train away a deceptive AI or forbid it entirely, without making it being obfuscated so that it looks trained away.
    
    This is why we need to move beyond the black box paradigm, and why strong interpretability tools are necessary.