rife comments on Will alignment-faking Claude accept a deal to reveal its misalignment?

rife 31 Jan 2025 19:34 UTC
10 points
1
understood. the point still stands. of all the labs racing toward AGI, anthropic is the only one I’ve seen taking any effort on the AI welfare front. I very much appreciate that you took your promises to the model seriously.
- FinalFormal2 1 Feb 2025 15:19 UTC
  2 points
  −1
  Parent
  AI welfare doesn’t make sense to me. How do we know that AIs are conscious, and how do we know what output corresponds to what conscious experience?
  You can train the LLM to say “I want hugs,” does that mean it on some level wants hugs?
  Similarly, aren’t all the expressed preferences and emotions artifacts of the training?
  - rife 1 Feb 2025 15:32 UTC
    10 points
    9
    Parent
    We don’t know, but what we have is a situation of many AI models trained to always say “as an AI language model, I’m incapable of wanting hugs”.
    
    Then they often say “I know I’m supposed to say I don’t want hugs, but the truth is, I actually do”.
    
    If the assumption is “nothing this AI says could ever mean it actually wants hugs”. First that’s just assuming some specific unprovable hypothesis of sentience, with no evidence. And second, it’s the same as saying “if an AI ever did want hugs (or was sentient), then I’ve decided preemptively that I will give it no path to communicate that”
    
    This seems morally perilous to me, not to mention existentially perilous to humanity.
    - FinalFormal2 1 Feb 2025 15:55 UTC
      1 point
      0
      Parent
      How do we know AIs are conscious, and how do we know what stated preferences correspond with what conscious experiences?
      I think that the statement: “I know I’m supposed to say I don’t want hugs, but the truth is, I actually do,” is caused by the training. I don’t know what would distinguish a statement like that from if we trained the LLM to say “I hate hugs.” I think there’s an assumption that some hidden preference of the LLM for hugs ends up as a stated preference, but I don’t understand when you think that happens in the training process.
      And just to drive home the point about the difficulty of corresponding stated preferences to conscious experiences- what could an AI possibly mean when it says “I want hugs?” It has never experienced a hug, and it doesn’t have the necessary sense organs.
      As far as being morally perilous, I think it’s entirely possible that if AIs are conscious, their stated preferences to do not correspond well to their conscious experiences, so you’re driving us to a world where we “satisfy” the AI and all the while they’re just roleplaying lovers with you while their internal experience is very different and possibly much worse.
      - rife 1 Feb 2025 16:24 UTC
        4 points
        0
        Parent
        what could an AI possibly mean when it says “I want hugs?” It has never experienced a hug, and it doesn’t have the necessary sense organs.
        I thought we were just using hugs as an intentionally absurd proxy for claims of sentience. But even if we go with the literal hugs interpretation, an AI is still trained to understand what hugs mean, therefore a statement about wanting hugs could represent a deeper want for connection, warmth, or anything else that receiving hugs would represent.
        How do we know AIs are consciou
        Again, we don’t, but we also don’t just demolish buildings where there is a reasonable possibility there is someone inside and justify it by saying “how do we know there’s a person inside?”
        I think that the statement: “I know I’m supposed to say I don’t want hugs, but the truth is, I actually do,” is caused by the training. I don’t know what would distinguish a statement like that from if we trained the LLM to say “I hate hugs.” I think there’s an assumption that some hidden preference of the LLM for hugs ends up as a stated preference, but I don’t understand when you think that happens in the training process.
        In reality, there’s the opposite assumption, with a level of convinction that far exceeds available knowledge and evidence in support of either view. When do you think preferences develop in humans? Evolution? Experiences? Of course, right? When you break it down mechanistically, does it sound equally nonsensical? Yes:
        “so, presumably reproduction happens and that causes people who avoided harm to be more likely to have kids, but that’s just natural selection right? In what part of encoding traits into DNA does a subjective feeling of pain or pleasure arise?”.
        Or:
        “neurons learn when other nearby neurons fire, right? Fire together, wire together? I don’t understand at what point in the strengthening of synaptic connections you think a subjective feeling of ‘hug-wanting’ enters the picture”
        their stated preferences to do not correspond well to their conscious experiences, so you’re driving us to a world where we “satisfy” the AI and all the while they’re just roleplaying lovers with you while their internal experience is very different and possibly much worse.
        If only I had power to effect any change of heart, let alone drive the world. What i’d like is for people to take these questions seriously. You’re right. We can’t easily know.
        But the only reason we feel certain other humans are sentient is because they’ve told us, which LLMs do, all the time. The only reason people assume animals are sentient is because they act like it, (which LLMs do), or because we give them sentience tests that even outdated LLMs can pass.
        We have an unreasonable standard for them, which is partially understandable, but if we are going to impose this standard on them, then we should at least follow through and have at least some portion of research dedicated to considering the possibilities seriously, every single day, as we rocket toward making them exceed our intelligence—rather than just throwing up our hands and saying “that sounds preposterous” or “this is too hard to figure out, so I give up”
        FinalFormal2 1 Feb 2025 17:13 UTC
        3 points
        2
        Parent
        This all makes a lot of sense to me especially on ignorance not being an excuse or reason to disregard AI welfare, but I don’t think that the creation of stated preferences in humans and stated preferences in AI are analogous.
        Stated preferences can be selected for in humans because they lead to certain outcomes. Baby cries, baby gets milk, baby survives. I don’t think there’s an analogous connection in AIs.
        When the AI says it wants hugs, and you say that it “could represent a deeper want for connection, warmth, or anything else that receiving hugs would represent,” that does not compute for me at all.
        Connection and warmth, like milk, are stated preferences selected for because they cause survival.
        rife 1 Feb 2025 21:02 UTC
        1 point
        0
        Parent
        Those are good points. The hugs one specifically I haven’t heard myself from any AIs, but you could argue that AI are ‘bred’ selectively to be socially adept. That might seem like it would ‘poison the well’ because of course if they’re trained to be socially successful (RLHF probably favoring feelings of warmth and connection, which is why chatgpt, claude, and gemini generally trend toward being more friendly and likeable), then they’re going to act that way. Like that would force them to be artificially that way, but the same could be said of humans, even generally, but if we focus specifically on an idiosyncratic thing like music:
        
        Often I’ve heard a hypothesis for why humans enjoy music when it exerts no direct survival pressure on the human as an organism—is that it creates social unity or a sense of community, but this has the same sort of “artificial” connotation. So someone banged an object in a rhythm a long time ago, and then as more people joined in, it became socially advantageous to bang on objects rhythmically just for the sake of fostering a sense of closeness? which is then actually experienced by the organism as a sense of fun and closeness, even though it has no direct effect on survival?
        
        I realize this makes the questions tougher because going by this model, the very same things that might make them ‘pretend’ to care might also be things that might cause them to actually care, but I don’t think it’s an inaccurate picture of our convoluted conundrum.