Notice that all those desiderata are much easier when the AI knows our (extrapolated) preferences. It is not clear at all that they can be achieved otherwise.
It seems like, as long as she wanted to, a human Alice could satisfy these desiderata when helping Bob, even though Alice doesn’t know Bob’s extrapolated preferences? So I’m not sure why you think an intelligent AI couldn’t do the same.
Maybe you think that it’s because Alice and Bob are both humans? But I also think Alice could satisfy these desiderata when helping an alien from a different planet—she would definitely make some mistakes, but presumably not the existentially catastrophic variety*.
*unless the alien has some really unusual values where an existential catastrophe can be caused by accident, e.g. “if anyone ever utters the word $WORD, that is the worst possible universe”, but those sorts of values seem very structurally different than human values.
I actually don’t think that Alice could help a (sufficiently alien) alien. She needs an alien theory of mind to understand what the alien wants, how they would extrapolate, how to help that extrapolation without manipulating it, and so on. Without that, she’s just projecting human assumptions in alien behaviour and statements.
She needs an alien theory of mind to understand what the alien wants
Absolutely, I would think that the first order of business would be to learn that alien theory of mind (and be very conservative until that’s done).
Maybe you’re saying that this alien theory of mind is unlearnable, even for a very intelligent Alice? That seems pretty surprising, and I don’t feel the force of that intuition (despite the Occam’s razor impossibility result).
It seems like, as long as she wanted to, a human Alice could satisfy these desiderata when helping Bob, even though Alice doesn’t know Bob’s extrapolated preferences? So I’m not sure why you think an intelligent AI couldn’t do the same.
Maybe you think that it’s because Alice and Bob are both humans? But I also think Alice could satisfy these desiderata when helping an alien from a different planet—she would definitely make some mistakes, but presumably not the existentially catastrophic variety*.
*unless the alien has some really unusual values where an existential catastrophe can be caused by accident, e.g. “if anyone ever utters the word $WORD, that is the worst possible universe”, but those sorts of values seem very structurally different than human values.
I actually don’t think that Alice could help a (sufficiently alien) alien. She needs an alien theory of mind to understand what the alien wants, how they would extrapolate, how to help that extrapolation without manipulating it, and so on. Without that, she’s just projecting human assumptions in alien behaviour and statements.
Absolutely, I would think that the first order of business would be to learn that alien theory of mind (and be very conservative until that’s done).
Maybe you’re saying that this alien theory of mind is unlearnable, even for a very intelligent Alice? That seems pretty surprising, and I don’t feel the force of that intuition (despite the Occam’s razor impossibility result).
Developing this idea a bit: https://www.lesswrong.com/posts/kMJxwCZ4mc9w4ezbs/how-an-alien-theory-of-mind-might-be-unlearnable