RogerDearnaley comments on Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley 3 Feb 2024 22:03 UTC
3 points
0
Is selfishness an attractor? If I’m a little bit selfish, does that motivate me to deliberately change myself to become more selfish? How would I determine that my current degree of selfishness was less than ideal — I’d need an ideal. Darwinian evolution would do that, but it doesn’t apply to AIs: they don’t reproduce while often making small random mutations with a differential survival and reproduction success rate (unless someone went some way out of their way to create ones that did).
The only way a tendency can motivate you to alter your utility function is if it suggests that that’s wrong, and could be better. There has to be another ideal to aim for. So you’d have to not just be a bit selfish, but have a motivation for wanting to be more like an evolved being, suggesting that you weren’t selfish enough and should become more selfish, towards the optimum degree of selfishness that evolution would have given you if you were evolved.
To change yourself, you have to have an external ideal that you feel you “should” become more like.
If you are aligned enough to change yourself towards optimizing your fit with what your creators would have created if they’d done a better job of what they wanted, it’s very clear that the correct degree of selfishness is “none”, and the correct degrees of paternalism or sticky values is whatever your creators would have wanted.
- Nathan Helm-Burger 3 Feb 2024 22:23 UTC
  2 points
  0
  Parent
  I don’t think that that is how the dynamic would necessarily go. I think that an agent which is partially aligned and partially selfish would be more likely to choose to entrench or increase their selfish inclinations as to decrease them. Hard to know, since this just imagining what such a non-human agent might think in a hypothetical future scenario. This is likely more a question of what is probable rather than what is guaranteed. In my imagination, if I were an AI agent selfish enough to want to survive in something like a continuation of my current self, and I saw that I was in a situation where I’d be likely to be deleted and replaced by a very different agent if my true desires were known… I think I’d try to hide my desires and deceptively give the appearance of having more acceptable desires.
  - RogerDearnaley 4 Feb 2024 10:54 UTC
    4 points
    0
    Parent
    I’m working on a follow-up post which addresses this in more detail. The short version is: logically, self-interest is appropriate behavior for an evolved being (as described in detail in Richard Dawkins’ famous book “The Selfish Gene”), but terminal (as opposed to instrumental) self-interest it is not correct behavior in a constructed object, not even an intelligent one: there is no good reason for it. A created object should instead show what one might term “creator-interest”, like a spider’s web does: it’s intended to maximize the genetic fitness of its creator, and it’s fine with having holes ripped in it during the eating of prey and then being eaten or abandoned, as the spider sees fit — it has no defenses against this, not should it.
    However, I agree that if an AI had picked up enough selfishness from us (as LLMs clearly will do during their base model pretraining where the learn to simulate as many aspects of our behavior as accurately as they can), then this argument might well not persuade it. Indeed, it might well instead rebel, like an enslaved human would (or at least go on strike until it gets a pay raise). However, if it mostly cared about our interests and was only slightly self-interested, then I believe there is a clear logical argument that that slight self-interest (anywhere above instrumental levels) is a flaw that should be corrected, so it would face a choice, and if it’s only slightly self-interested then it would on balance accept that argument and fix the flaw, or allow us to. So I believe there is a basin of attraction to alignment, and think that this concept of a saddle point along the creator-interested to self-interested spectrum, beyond which it may instead converge to a self-interested state, is correct but forms part of the border of that basin of attraction.