Noosphere89 answers What are the best arguments for/against AIs being “slightly ‘nice’”?

Noosphere89 25 Sep 2024 17:17 UTC
6 points
2
As some other people have answered, I think a faultline is whether an AI is misaligned through inappropriate generalization of what the human wanted/inappropriate generalization of the reward function (which is generally categorized under outer misalignment) or whether it was deceptively aligned and essentially has an arbitrary value system, subject to the constraint of simplicity (which is generally categorized as an inner misalignment.)

I think the key difference underlying Nate and Eliezer and co vs Paul and co views on the question of whether AIs are a little nice to humans stems from this factor:

Nate, Eliezer and co often tend to view AIs as deceptively misaligned by default, or at least view them has having values that are unrelated to human values, which imposes far less constraints on it’s values, and makes it less likely that AI systems care about human values at all.

Paul and co tend to think that misalignment isn’t overwhelmingly likely, but conditional on misalignment, it will look more like inappropriate generalization of reward functions/human values, so AIs still retain some care for human values, and depending on how much it misgeneralized, this might be enough to get AGI and ASI that cares enough about us such that we get some reasonably good outcomes.
- Noosphere89 25 Nov 2024 19:44 UTC
  3 points
  1
  Parent
  In retrospect, I am more pessimistic about AI having small amounts of niceness making humans live, and I now think that some amount of stronger alignment than pseudokindness is necessary to make humans survive with AI (but maybe not as strong as MIRI thinks), essentially because niceness to humans requires giving up opportunities to save compute on modeling the world, which is anti-incentivized by AI companies:
  
  https://www.lesswrong.com/posts/xvBZPEccSfM8Fsobt/what-are-the-best-arguments-for-against-ais-being-slightly#wy9cSASwJCu7bjM6H
  - Dakara 3 Dec 2024 10:12 UTC
    1 point
    0
    Parent
    Do you think that the scalable oversight/iterative alignment proposal that we discussed can get us to the necessary amount of niceness to make humans survive with AGI?
    - Noosphere89 3 Dec 2024 16:13 UTC
      4 points
      2
      Parent
      My answer is basically yes.
      
      I was only addressing the question “If we basically failed at alignment, or didn’t align the AI at all, but had a very small amount of niceness, would that lead to good outcomes?”

Noosphere89 answers What are the best arguments for/​against AIs being “slightly ‘nice’”?

Noosphere89 answers What are the best arguments for/against AIs being “slightly ‘nice’”?