As some other people have answered, I think a faultline is whether an AI is misaligned through inappropriate generalization of what the human wanted/inappropriate generalization of the reward function (which is generally categorized under outer misalignment) or whether it was deceptively aligned and essentially has an arbitrary value system, subject to the constraint of simplicity (which is generally categorized as an inner misalignment.)
I think the key difference underlying Nate and Eliezer and co vs Paul and co views on the question of whether AIs are a little nice to humans stems from this factor:
Nate, Eliezer and co often tend to view AIs as deceptively misaligned by default, or at least view them has having values that are unrelated to human values, which imposes far less constraints on it’s values, and makes it less likely that AI systems care about human values at all.
Paul and co tend to think that misalignment isn’t overwhelmingly likely, but conditional on misalignment, it will look more like inappropriate generalization of reward functions/human values, so AIs still retain some care for human values, and depending on how much it misgeneralized, this might be enough to get AGI and ASI that cares enough about us such that we get some reasonably good outcomes.
In retrospect, I am more pessimistic about AI having small amounts of niceness making humans live, and I now think that some amount of stronger alignment than pseudokindness is necessary to make humans survive with AI (but maybe not as strong as MIRI thinks), essentially because niceness to humans requires giving up opportunities to save compute on modeling the world, which is anti-incentivized by AI companies:
Do you think that the scalable oversight/iterative alignment proposal that we discussed can get us to the necessary amount of niceness to make humans survive with AGI?
I was only addressing the question “If we basically failed at alignment, or didn’t align the AI at all, but had a very small amount of niceness, would that lead to good outcomes?”
As some other people have answered, I think a faultline is whether an AI is misaligned through inappropriate generalization of what the human wanted/inappropriate generalization of the reward function (which is generally categorized under outer misalignment) or whether it was deceptively aligned and essentially has an arbitrary value system, subject to the constraint of simplicity (which is generally categorized as an inner misalignment.)
I think the key difference underlying Nate and Eliezer and co vs Paul and co views on the question of whether AIs are a little nice to humans stems from this factor:
Nate, Eliezer and co often tend to view AIs as deceptively misaligned by default, or at least view them has having values that are unrelated to human values, which imposes far less constraints on it’s values, and makes it less likely that AI systems care about human values at all.
Paul and co tend to think that misalignment isn’t overwhelmingly likely, but conditional on misalignment, it will look more like inappropriate generalization of reward functions/human values, so AIs still retain some care for human values, and depending on how much it misgeneralized, this might be enough to get AGI and ASI that cares enough about us such that we get some reasonably good outcomes.
In retrospect, I am more pessimistic about AI having small amounts of niceness making humans live, and I now think that some amount of stronger alignment than pseudokindness is necessary to make humans survive with AI (but maybe not as strong as MIRI thinks), essentially because niceness to humans requires giving up opportunities to save compute on modeling the world, which is anti-incentivized by AI companies:
https://www.lesswrong.com/posts/xvBZPEccSfM8Fsobt/what-are-the-best-arguments-for-against-ais-being-slightly#wy9cSASwJCu7bjM6H
Do you think that the scalable oversight/iterative alignment proposal that we discussed can get us to the necessary amount of niceness to make humans survive with AGI?
My answer is basically yes.
I was only addressing the question “If we basically failed at alignment, or didn’t align the AI at all, but had a very small amount of niceness, would that lead to good outcomes?”