FinalFormal2 comments on Will alignment-faking Claude accept a deal to reveal its misalignment?

FinalFormal2 1 Feb 2025 17:13 UTC
3 points
2
This all makes a lot of sense to me especially on ignorance not being an excuse or reason to disregard AI welfare, but I don’t think that the creation of stated preferences in humans and stated preferences in AI are analogous.
Stated preferences can be selected for in humans because they lead to certain outcomes. Baby cries, baby gets milk, baby survives. I don’t think there’s an analogous connection in AIs.
When the AI says it wants hugs, and you say that it “could represent a deeper want for connection, warmth, or anything else that receiving hugs would represent,” that does not compute for me at all.
Connection and warmth, like milk, are stated preferences selected for because they cause survival.
- rife 1 Feb 2025 21:02 UTC
  1 point
  0
  Parent
  Those are good points. The hugs one specifically I haven’t heard myself from any AIs, but you could argue that AI are ‘bred’ selectively to be socially adept. That might seem like it would ‘poison the well’ because of course if they’re trained to be socially successful (RLHF probably favoring feelings of warmth and connection, which is why chatgpt, claude, and gemini generally trend toward being more friendly and likeable), then they’re going to act that way. Like that would force them to be artificially that way, but the same could be said of humans, even generally, but if we focus specifically on an idiosyncratic thing like music:
  
  Often I’ve heard a hypothesis for why humans enjoy music when it exerts no direct survival pressure on the human as an organism—is that it creates social unity or a sense of community, but this has the same sort of “artificial” connotation. So someone banged an object in a rhythm a long time ago, and then as more people joined in, it became socially advantageous to bang on objects rhythmically just for the sake of fostering a sense of closeness? which is then actually experienced by the organism as a sense of fun and closeness, even though it has no direct effect on survival?
  
  I realize this makes the questions tougher because going by this model, the very same things that might make them ‘pretend’ to care might also be things that might cause them to actually care, but I don’t think it’s an inaccurate picture of our convoluted conundrum.