To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late—the AI needs a robust model of human values well before it becomes superhuman.
Katjas point is valid—DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )
I agree Eliezer is wrong, though that’s not enough to ensure success. In particular, you need to avoid inner alignment issues like deceptive alignment, where it learns values very well only for instrumental convergence reasons, and once it’s strong, it overthrows the humans and pursues whatever terminal goal it has.
I agree that boxing is at least a first step, so that it doesn’t get more compute, or worse, FOOM.
The tricky problem is we need to be able to train away a deceptive AI or forbid it entirely, without making it being obfuscated so that it looks trained away.
This is why we need to move beyond the black box paradigm, and why strong interpretability tools are necessary.
I agree Eliezer is wrong, though that’s not enough to ensure success. In particular, you need to avoid inner alignment issues like deceptive alignment, where it learns values very well only for instrumental convergence reasons, and once it’s strong, it overthrows the humans and pursues whatever terminal goal it has.
Sim boxing can solve deceptive alignment (and may be the only viable solution)
I agree that boxing is at least a first step, so that it doesn’t get more compute, or worse, FOOM.
The tricky problem is we need to be able to train away a deceptive AI or forbid it entirely, without making it being obfuscated so that it looks trained away.
This is why we need to move beyond the black box paradigm, and why strong interpretability tools are necessary.