RussellThor comments on Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RussellThor 14 Feb 2024 8:46 UTC
2 points
0
Thanks for the effort.
In the discussion about selfishness on this post it seems a bit implied that we know how to make a “self” or it will just appear like a humans. However that is not my experience with GPT-4. Often I have found its lack of self-awareness a significant handicap in its ability to be useful—I assume it has some self awareness, it doesn’t and wastes my time as a result. Consider a game engine that does “intuition” and “search” such as a GO engine. It is very good at examining possibilities and “intuits” what moves to consider and can model GO very well, but not itself at all.
If there is an algorithmic structure that self-awareness requires to be efficient and effective(why wouldn’t there be), then just throwing compute to get GPT-X won’t necessarily get there at all. If we do get a capable AI it won’t act in a way we would expect.
For humans it seems there is evolutionary pressure for us not only to have a “self” but to appear to have a consistent one to others so they can trust us etc. Additionally our brain structure prevents us from must being in a flow state the whole time where we do a task without questioning whether we should do something better, or whether it is the right thing to do. We accept this and furthermore consider this to be a sign of a complete human mind.
Our current AI seems more like creating “mind pieces” than a creature with a self/consciousness that would question its goals. Is there a difference between “what it wants and what we want” or just “what is wanted”?
I agree in general terms that “alignment has a basin of attraction” and GPT-4 is inside is somewhat justified.
- RogerDearnaley 16 Feb 2024 0:05 UTC
  2 points
  1
  Parent
  My experience is that LLMs like GPT-4 can be prompted to behave like they have a pretty consistent self, especially if you are prompting them to take on a human role that’s described in detail, but I agree that the default assistant role that GPT-4 has been RLHF trained into is pretty inconsistent and rather un-self-aware. I think some of the ideas I discuss in my post Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor are relevant here: basically, it’s a mistake to think of an LLM, even an instruct-trained one, as having a single consistent personality, so self-awareness is more challenging for it than it is for us.
  I suspect the default behavior for an LLM trained from text generated by a great many humans is both self-interested (since basically all humans are), and also, as usual for an LLM, inconsistent in its behavior, or at least, easily prompted into any of many different behavior patterns and personalities, across the range it was trained on. So I’d actually expect to see selfishness without having a consistent self. Neither of those behaviors are desirable in an AGI, so we’d need to overcome both of these default tendencies in LLMs when constructing an AGI using one: we need to make it consistent, and consistently creator-interested.
  Your point that humans tend to go out of their way, and are under evolutionary pressure, to appear consistent in our behavior so that other humans can trust us is an interesting one. There are times during conflicts where being hard-to-predict can be advantageous, but humans spend a lot of time cooperating with each other and then being consistent and predictable have clear advantages.