TurnTrout comments on Seriously, what goes wrong with “reward the agent when it makes you smile”?

TurnTrout 15 Aug 2022 4:12 UTC
LW: 2 AF: 2
0
AF
Who do you think would make the claim that the AI in this scenario would care about “literally making you smile”, as opposed to some complex, non-human-comprehensible goal somewhat related to humans smiling?
I don’t know? Seems like a representative kind of “potential risk” I’ve read about before, but I’m not going to go dig it up right now. (My post also isn’t primarily about who said what, so I’m confused by your motivation for posting this question?)
- abramdemski 15 Aug 2022 16:01 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I’ve often repeated scenarios like this, or like the paperclip scenario.
  My intention was never to state that the specific scenario was plausible or default or expected, but rather, that we do not know how to rule it out, and because of that, something similarly bad (but unexpected and hard to predict) might happen.
  The structure of the argument we eventually want is one which could (probabilistically, and of course under some assumptions) rule out this outcome. So to me, pointing it out as a possible outcome is a way of pointing to the inadequacy of our current ability to analyze the situation, not as part of a proto-model in which we are conjecturing that we will be able to predict “the AI will make paperclips” or “the AI will literally try to make you smile”.
  What links here?
  - Steven Byrnes's comment on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?” by Joe Carlsmith (15 Nov 2023 19:14 UTC; 5 points)