In retrospect, I am more pessimistic about AI having small amounts of niceness making humans live, and I now think that some amount of stronger alignment than pseudokindness is necessary to make humans survive with AI (but maybe not as strong as MIRI thinks), essentially because niceness to humans requires giving up opportunities to save compute on modeling the world, which is anti-incentivized by AI companies:
Do you think that the scalable oversight/iterative alignment proposal that we discussed can get us to the necessary amount of niceness to make humans survive with AGI?
I was only addressing the question “If we basically failed at alignment, or didn’t align the AI at all, but had a very small amount of niceness, would that lead to good outcomes?”
In retrospect, I am more pessimistic about AI having small amounts of niceness making humans live, and I now think that some amount of stronger alignment than pseudokindness is necessary to make humans survive with AI (but maybe not as strong as MIRI thinks), essentially because niceness to humans requires giving up opportunities to save compute on modeling the world, which is anti-incentivized by AI companies:
https://www.lesswrong.com/posts/xvBZPEccSfM8Fsobt/what-are-the-best-arguments-for-against-ais-being-slightly#wy9cSASwJCu7bjM6H
Do you think that the scalable oversight/iterative alignment proposal that we discussed can get us to the necessary amount of niceness to make humans survive with AGI?
My answer is basically yes.
I was only addressing the question “If we basically failed at alignment, or didn’t align the AI at all, but had a very small amount of niceness, would that lead to good outcomes?”