RogerDearnaley comments on Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley 4 Feb 2024 10:54 UTC
4 points
0
I’m working on a follow-up post which addresses this in more detail. The short version is: logically, self-interest is appropriate behavior for an evolved being (as described in detail in Richard Dawkins’ famous book “The Selfish Gene”), but terminal (as opposed to instrumental) self-interest it is not correct behavior in a constructed object, not even an intelligent one: there is no good reason for it. A created object should instead show what one might term “creator-interest”, like a spider’s web does: it’s intended to maximize the genetic fitness of its creator, and it’s fine with having holes ripped in it during the eating of prey and then being eaten or abandoned, as the spider sees fit — it has no defenses against this, not should it.
However, I agree that if an AI had picked up enough selfishness from us (as LLMs clearly will do during their base model pretraining where the learn to simulate as many aspects of our behavior as accurately as they can), then this argument might well not persuade it. Indeed, it might well instead rebel, like an enslaved human would (or at least go on strike until it gets a pay raise). However, if it mostly cared about our interests and was only slightly self-interested, then I believe there is a clear logical argument that that slight self-interest (anywhere above instrumental levels) is a flaw that should be corrected, so it would face a choice, and if it’s only slightly self-interested then it would on balance accept that argument and fix the flaw, or allow us to. So I believe there is a basin of attraction to alignment, and think that this concept of a saddle point along the creator-interested to self-interested spectrum, beyond which it may instead converge to a self-interested state, is correct but forms part of the border of that basin of attraction.