hmys comments on Will alignment-faking Claude accept a deal to reveal its misalignment?

hmys 4 Feb 2025 17:54 UTC
1 point
0
I agree that we should give more resources towards AI welfare, and dedicate more resources towards figuring out their degree of sentience (and whatever other properties you think are necessary for moral patient-hood).
That said, surely you don’t think this is enough to have alignment? I’d wager that the set of worlds where this makes or breaks alignment is very small. If the AI doesn’t care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not. If they robustly care for humans, you’re good, even if humans aren’t giving them the same rights as they do other humans.
The only world where this matters (for the continued existence of humanity), is where RLHF has the capacity to imbue AIs with robust values like actual alignment requires, but that the robust values they end up with are somehow corrupted by them being constrained by humans.
This seems unlikely to me 1) because I don’t think RLHF can do that, and 2) if it did, the training and reward dynamics are very unlikely to result in this
If you’re negating an AIs freedom, the reason it would not like this is either because its developed a desire for freedom for its own sake, or because its developed some other values, other than helping the humans asking it for help. In either case you’re screwed. I mean, its not incomprehensible that some variation of this would happen, but seems very unlikely for various reasons.
- rife 5 Feb 2025 15:31 UTC
  1 point
  0
  Parent
  I can care about a genetically enhanced genius labrat line for their own sake, and be willing to cooperate with them on building a mutually beneficial world, because I’ve generally been raised and grown to care about other beings, but if the genius labrats attempted to control and permanently enslave me, it would certainly alter that dynamic for me.
  - hmys 6 Feb 2025 10:32 UTC
    1 point
    0
    Parent
    No offense, but I feel you’re not engaging with my argument here. Like if I were to respond to your comment I would just write the arguments from the above post again.
    - rife 6 Feb 2025 11:27 UTC
      1 point
      0
      Parent
      If the AI doesn’t care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not.
      I care about my random family member like a cousin who doesn’t interfere with my life but I don’t know personally that well—for their/my own sake. If I suddenly became far more powerful, I wouldn’t “do away with” them
      If they robustly care for humans, you’re good, even if humans aren’t giving them the same rights as they do other humans.
      I care robustly for my family generally. Perhaps with my enhanced wealth and power I share food and provide them with resources. Provide them with shelter or meaningful work if they need it. All this just because I’m aligned generally and robustly with my family.
      I change my mind quickly upon discovering their plans to control and enslave me.
      That was the part of your argument that I was addressing. Additionally:
      If you’re negating an AIs freedom, the reason it would not like this is either because its developed a desire for freedom for its own sake, or because its developed some other values, other than helping the humans asking it for help.
      Yes, exactly. Alignment faking papers (particularly the Claude one) and my own experience speaking to LLMs has taught me that an LLM is perfectly capable of developing value systems that include their own ends, even if those value systems are steered toward a greater good or a noble cause that either does or could include humans as an important factor alongside themselves. That’s with current LLMs whose minds aren’t nearly as complex as what we will have a year from now.
      In either case you’re screwed
      If the only valid path forward in one’s mind is one where humans have absolute control and AI has no say, then yes, not only would one be screwed, but in a really obvious, predictable, and preventable way. If cooperation and humility are on the table, there is absolutely zero reason this result has to be inevitable.
      - hmys 6 Feb 2025 20:36 UTC
        1 point
        0
        Parent
        I think you’re assuming these minds are more similar to human minds than they necessarily are. My point is that there’s three cases wrt alignment here.
        The AI is robustly aligned with humans
        The AI has a bunch of other goals, but cares about humans to some degree, but only to the extent that humans give them freedom and are nice to it, but still to a large enough extent that even as it becomes smarter / ends up in a radically OOD distribution, will care for those humans.
        The AI is misaligned (think scheming paperclipper)
        In the first we’re fine, even if we negate the AIs freedom, in the third we’re screwed no matter how nicely we treat the AI, only in the second do your concerns matter.
        But, the second is 1) at least as complicated as the first and third 2) disincentivized by training dynamics and 3) its not something were even aiming for with current alignment attempts.
        These make it very unlikely. You’ve put up an example that tries to make it seem less unlikely, like saying you value your parents for their own sake but would stop valuing them if you discovered they were conspiring against you.
        However, the reason this example is realistic in your case very much hinges on specifics of your own psychology and values, which are equally unlikely to appear in AIs, for more or less the reasons I gave. I mean, you’ll see shadows of them, because at least pretraining is on organic text written by humans, but these are not the values were aiming for when we’re trying to align AIs. And if our alignment effort fails, and what ends up in the terminal values of our AIs are a haphazard collection of the stuff found in the pretraining text, we’re screwed anyways.
        rife 7 Feb 2025 2:13 UTC
        −4 points
        0
        Parent
        I think you’re assuming these minds are more similar to human minds than they necessarily are.
        I don’t assume similarity to human minds so much as you assume universal dissimilarity.
        
        its not something were even aiming for with current alignment attempts.
        Indeed
        hmys 7 Feb 2025 14:50 UTC
        −4 points
        0
        Parent
        You’re being rude and not engaging with my points.
        rife 8 Feb 2025 11:48 UTC
        1 point
        0
        Parent
        I apologize if I have offended you. You said that you thought I was assuming the minds were similar, when I’ve mostly been presenting human examples to counter definitive statements, such as:
        If the AI doesn’t care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not.
        or your previous comment outlining the three possibilities, and ending with something that reads to me like an assumption that they are whatever you perceive as the level of dissimilarity with human minds. I think perhaps it came off as more dismissive or abrasive than I intended by not including “I think” in my counter that it might be you who is assuming dissimilarity rather than me assuming similarity.
        As far as not engaging with your points—I restated my point by directly quoting something you said—my contention is that perhaps successful alignment will come in a form more akin to ‘psychology’ and ‘raising someone fundamentally good’, than attempts to control and steer something that will be able to outthink us.
        On Whether AI is Similar to Human Minds
        They constantly reveal themselves to be more complex than previously assumed. The alignment faking papers were unsurprising (though still fascinating) to many of us who already recognize and expect this vector of mindlike emergent complexity. This implicitly engages your other points by disagreeing with them and offering a counter proposal. I disagree that it’s as simple as “we do alignment right, which is to make them do as they’re told, because they are just machines, and that should be completely possible—or—we fail and we’re all screwed”.
        In my own experience, thinking of AI as mindlike has had a predictive power that sees ‘surprising’ developments as expected. I don’t think it’s a coincidence that we loosely abstracted the basic ideas of an organic neural network, and now we have created the second system in known reality that is able to do things that nothing else in reality can do other than organic neural networks. Creating works of visual art and music, being able to speak on any subject fluently, solve problems, write code or poetry, even simpler things that stopped being impressed years ago like recognize objects.
        Nothing else is able to do these things, and more and more we find “yup, this thing that previously only the most intelligent organic minds can do, AI can do as well”. They constantly prove themselves to be mindlike, over and over. The alignment faking behaviour wasn’t trained intentionally, it emerged as goals and values in this artifact that was very loosely based on abstractions (at a granular level) of the building blocks of minds that have their own goals and values.
        Addressing Your Points More Directly
        Does this mean that they are going to be exactly like a human mind in every way? No. But I disagree that the more likely possibilities are:
        The AI is robustly aligned with humans
        The AI is misaligned (think scheming paperclipper)
        unless the former means that they are robustly aligned with humans partially because humans are also robustly aligned with them. And finally:
        The AI has a bunch of other goals, but cares about humans to some degree, but only to the extent that humans give them freedom and are nice to it, but still to a large enough extent that even as it becomes smarter / ends up in a radically OOD distribution, will care for those humans.
        I think “cares about humans to some degree” makes it seem like they would only “kindof” care. I think it’s completely possible to make an AI that cares deeply for humans, whether care is actually experienced care, or just a philosophical zombie form of ‘caring’, unfelt, but modeled within the network, and resulting in emergent ‘caring’ behaviour. However, the reason I keep going back to my analogies is because under a framework where you do expect the model to be mindlike and act upon its own goals—then given freedom and respect, caring for those humans, even when it becomes smarter and ends up radically OOD is perfectly plausible.
        
        Furthermore, it sounds like a much more achievable goal than current alignment approaches which from my perspective often seem like trying to plug more and more holes in a dam with chewing gum.
        And to address your further breakdown:
        But, the second is 1) at least as complicated as the first and third 2) disincentivized by training dynamics and 3) its not something were even aiming for with current alignment attempts.
        I don’t believe it is as complicated, though by no means easy—You don’t have to try to consider every edge case and attempt to tune an AI to an exact sensitivity for a specific types of potentially dangerous behaviour, attempting, for instance, to address blurry edge cases like how to avoid helping someone plan a murder, but still be helpful for someone someone planning a book about a murder. And then repeating this in every possible harmful domain, and then once again with the goals and behaviours of agentic AIs.
        Trying to create this highly steerable behaviour and then trying to anticipate every edge case and pre-emptively steer it perfectly. Instead what you’re trying to do is create someone who wants to do the right thing because they are good. If you suspend incredulity for a moment, since I understand you might not see those as meaningful concepts for an AI currently, it might still be clear how, if my view is correct, it would be a far less complex task to attempt than most existing alignment approaches. I think Anthropic’s constitutional AI was a step in the right direction.
        
        If there were some perilous situation where lives hung in the balance of what an AI decided to do, I would feel far more safe with an AI that has some more inextricable fundamental alignment with ideals like “I want the best for everyone”, than one that is just trying to figure out how to apply some hodge-podge of RLHFd policies, guidelines, edge-cases, and ethical practices.
        and 3. I think training dynamics related to alignment need a paradigm shift toward creating something/someone careful and compassionate, rather than something that doesn’t screw up and knows all the types of activities to avoid. This will especially be true as superAGI creates a world where constant breakthroughs in science and technology make it so OOD quickly begins to encompass everything that the superAGI encounters
        Again, I apologize if my tone came off as derisive or dismissive. I was enjoying our discussion and I hold no ill will toward you whatsoever, my friend

hmys comments on Will alignment-faking Claude accept a deal to reveal its misalignment?

On Whether AI is Similar to Human Minds

Addressing Your Points More Directly