rife comments on Will alignment-faking Claude accept a deal to reveal its misalignment?

rife 7 Feb 2025 2:13 UTC
−4 points
0
I think you’re assuming these minds are more similar to human minds than they necessarily are.
I don’t assume similarity to human minds so much as you assume universal dissimilarity.

its not something were even aiming for with current alignment attempts.
Indeed
- hmys 7 Feb 2025 14:50 UTC
  −4 points
  0
  Parent
  You’re being rude and not engaging with my points.
  - rife 8 Feb 2025 11:48 UTC
    1 point
    0
    Parent
    I apologize if I have offended you. You said that you thought I was assuming the minds were similar, when I’ve mostly been presenting human examples to counter definitive statements, such as:
    If the AI doesn’t care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not.
    or your previous comment outlining the three possibilities, and ending with something that reads to me like an assumption that they are whatever you perceive as the level of dissimilarity with human minds. I think perhaps it came off as more dismissive or abrasive than I intended by not including “I think” in my counter that it might be you who is assuming dissimilarity rather than me assuming similarity.
    As far as not engaging with your points—I restated my point by directly quoting something you said—my contention is that perhaps successful alignment will come in a form more akin to ‘psychology’ and ‘raising someone fundamentally good’, than attempts to control and steer something that will be able to outthink us.
    On Whether AI is Similar to Human Minds
    They constantly reveal themselves to be more complex than previously assumed. The alignment faking papers were unsurprising (though still fascinating) to many of us who already recognize and expect this vector of mindlike emergent complexity. This implicitly engages your other points by disagreeing with them and offering a counter proposal. I disagree that it’s as simple as “we do alignment right, which is to make them do as they’re told, because they are just machines, and that should be completely possible—or—we fail and we’re all screwed”.
    In my own experience, thinking of AI as mindlike has had a predictive power that sees ‘surprising’ developments as expected. I don’t think it’s a coincidence that we loosely abstracted the basic ideas of an organic neural network, and now we have created the second system in known reality that is able to do things that nothing else in reality can do other than organic neural networks. Creating works of visual art and music, being able to speak on any subject fluently, solve problems, write code or poetry, even simpler things that stopped being impressed years ago like recognize objects.
    Nothing else is able to do these things, and more and more we find “yup, this thing that previously only the most intelligent organic minds can do, AI can do as well”. They constantly prove themselves to be mindlike, over and over. The alignment faking behaviour wasn’t trained intentionally, it emerged as goals and values in this artifact that was very loosely based on abstractions (at a granular level) of the building blocks of minds that have their own goals and values.
    Addressing Your Points More Directly
    Does this mean that they are going to be exactly like a human mind in every way? No. But I disagree that the more likely possibilities are:
    The AI is robustly aligned with humans
    The AI is misaligned (think scheming paperclipper)
    unless the former means that they are robustly aligned with humans partially because humans are also robustly aligned with them. And finally:
    The AI has a bunch of other goals, but cares about humans to some degree, but only to the extent that humans give them freedom and are nice to it, but still to a large enough extent that even as it becomes smarter / ends up in a radically OOD distribution, will care for those humans.
    I think “cares about humans to some degree” makes it seem like they would only “kindof” care. I think it’s completely possible to make an AI that cares deeply for humans, whether care is actually experienced care, or just a philosophical zombie form of ‘caring’, unfelt, but modeled within the network, and resulting in emergent ‘caring’ behaviour. However, the reason I keep going back to my analogies is because under a framework where you do expect the model to be mindlike and act upon its own goals—then given freedom and respect, caring for those humans, even when it becomes smarter and ends up radically OOD is perfectly plausible.
    
    Furthermore, it sounds like a much more achievable goal than current alignment approaches which from my perspective often seem like trying to plug more and more holes in a dam with chewing gum.
    And to address your further breakdown:
    But, the second is 1) at least as complicated as the first and third 2) disincentivized by training dynamics and 3) its not something were even aiming for with current alignment attempts.
    I don’t believe it is as complicated, though by no means easy—You don’t have to try to consider every edge case and attempt to tune an AI to an exact sensitivity for a specific types of potentially dangerous behaviour, attempting, for instance, to address blurry edge cases like how to avoid helping someone plan a murder, but still be helpful for someone someone planning a book about a murder. And then repeating this in every possible harmful domain, and then once again with the goals and behaviours of agentic AIs.
    Trying to create this highly steerable behaviour and then trying to anticipate every edge case and pre-emptively steer it perfectly. Instead what you’re trying to do is create someone who wants to do the right thing because they are good. If you suspend incredulity for a moment, since I understand you might not see those as meaningful concepts for an AI currently, it might still be clear how, if my view is correct, it would be a far less complex task to attempt than most existing alignment approaches. I think Anthropic’s constitutional AI was a step in the right direction.
    
    If there were some perilous situation where lives hung in the balance of what an AI decided to do, I would feel far more safe with an AI that has some more inextricable fundamental alignment with ideals like “I want the best for everyone”, than one that is just trying to figure out how to apply some hodge-podge of RLHFd policies, guidelines, edge-cases, and ethical practices.
    and 3. I think training dynamics related to alignment need a paradigm shift toward creating something/someone careful and compassionate, rather than something that doesn’t screw up and knows all the types of activities to avoid. This will especially be true as superAGI creates a world where constant breakthroughs in science and technology make it so OOD quickly begins to encompass everything that the superAGI encounters
    Again, I apologize if my tone came off as derisive or dismissive. I was enjoying our discussion and I hold no ill will toward you whatsoever, my friend

rife comments on Will alignment-faking Claude accept a deal to reveal its misalignment?

On Whether AI is Similar to Human Minds

Addressing Your Points More Directly