The AI alignment problem as a consequence of the recursive nature of plans

Ruby’s re­cent post on how Plans are Re­cur­sive & Why This is Im­por­tant has strongly shaped my think­ing. I think it is a fact that de­serves more at­ten­tion, both when think­ing about our own lives as well as from an AI al­ign­ment stand­point. The AI al­ign­ment prob­lem seems to be a prob­lem in­her­ent to seek­ing out suffi­ciently com­plex goals.

Any suffi­ciently com­plex goal, in or­der to be achieved, needs to be bro­ken down re­cur­sively into sub-goals un­til you have ex­e­cutable ac­tions. E.g. “be­com­ing rich” is not an ex­e­cutable ac­tion, and nei­ther is “leav­ing the max­i­mum copies of your genes.” No one ever *does* those things. The same goes to val­ues — what does it mean to value art, to value mean­ing, to value life? What­ever it is that you want, you need to trans­late it into mus­cle move­ments and/​or lines of code in or­der to get any close at all to achiev­ing it. This is an image Ruby used in that post:

Now, imag­ine that you have a re­ally com­plex goal in that up­per node. Say you want to “cre­ate the best civ­i­liza­tion pos­si­ble” or “max­i­mize hap­piness in the uni­verse.” Or you just have a less com­plex goal—“hav­ing a good and mean­ingful life” and you’re in a re­ally com­plex en­vi­ron­ment, like the mod­ern cap­i­tal­ist econ­omy. A tree rep­re­sent­ing that goal and the sub-goals and ac­tions it needs to be de­com­posed to would be very tall. I can’t make my own image be­cause I am in a bor­rowed com­puter, but imag­ine the tree above with like, 20 lev­els of nodes.

Now, our cog­ni­tive re­sources are not in­finite, and they don’t scale with the com­plex­ity of the en­vi­ron­ment. You wouldn’t be able to store that en­tire tree in your mind at all times and always keep track of how your ev­ery ac­tion is serv­ing your one ul­ti­mate ter­mi­nal goal. So in or­der to act at all, in or­der to get any­where, you need to for­get a con­sid­er­able part of your su­per-goals, and fo­cus on the more achiev­able sub-goals—you need to “zoom in.”

Any agent that seeks X as an in­stru­men­tal goal, with, say, Y as a ter­mi­nal goal, can eas­ily be out­com­peted by an agent that seeks X as a ter­mi­nal goal. If you’re think­ing of Y all the time, you’re sim­ply not go­ing to do the best you can to get X. Some­one who sees be­com­ing a lawyer as their ter­mi­nal goal, some­one who is in­trin­si­cally mo­ti­vated by it, will prob­a­bly do much bet­ter at be­com­ing a lawyer than some­one who sees it as merely a step to­wards some­thing else. That is analo­gous to how an AI trained to do X will out­com­pete an AI trained to do X, *plus* value hu­man lives and mean­ing and ev­ery­thing.

Im­por­tantly, this hap­pens in hu­mans at a very visceral level. Mo­ti­va­tion, de­sire, want­ing, are not in­finite re­sources. If there is some­thing that is, the­o­ret­i­cally, your one true value/​de­sire/​goal, you’re not nec­es­sar­ily go­ing to feel any mo­ti­va­tion at all to pur­sue it if you have lower-down sub-goals to achieve first, even if what origi­nally made you se­lect those sub-goals was that they brought you closer to that su­per-goal.

That may be why some­times we find our­selves un­sure about what we want in life. That also may be why we dis­agree on what val­ues should guide so­ciety. Our mo­ti­va­tion needs to be di­rected at some­thing con­crete and ac­tion­able in or­der for us to get any­where at all.

So the crux of the is­sue is that we need to man­age to 1) de­ter­mine the se­quence of sub-goals and ex­e­cutable ac­tions that will lead to our ter­mi­nal goal be­ing achieved, and 2) make sure that those sub-goals and ex­e­cutable ac­tions re­main al­igned with the ter­mi­nal goal.

There are many ex­am­ples of that go­ing wrong. Evolu­tion “wants” us to “leave the max­i­mum copies of our genes.” The ex­e­cutable ac­tions that that comes down to are things like “be­ing at­tracted to spe­cific fea­tures in the op­po­site sex that in the en­vi­ron­ment of evolu­tion­ary adapt­ed­ness cor­re­lated with leav­ing the max­i­mum copies of your genes” and “hav­ing sex.” Nowa­days, of course, hav­ing sex doesn’t lead to spread­ing genes, so evolu­tion is kind of failing at the hu­man al­ign­ment prob­lem.

Another ex­am­ple would be peo­ple who work their en­tire lives to suc­ceed at a spe­cific pres­ti­gious pro­fes­sion and get a lot of money, but when they do, they end up not be­ing en­tirely sure of why they wanted that in the first place, and find them­selves un­happy.

You can see hu­mans max­i­miz­ing for i.e. pic­tures of fem­i­nine curves on In­sta­gram as kind of like an AI max­i­miz­ing pa­per­clips. Some peo­ple think of the pa­per­clip max­i­mizer thought ex­per­i­ment as weird or ar­bi­trary, but it re­ally isn’t any differ­ent from what we already do as hu­mans. From what we have to do.

What AI does, in my view, is mas­sively scale and ex­ac­er­bate that already-ex­ist­ing is­sue. AI max­i­mizes effi­ciency, max­i­mizes our ca­pac­ity to get what we want, and be­cause of that, spec­i­fy­ing what we want be­comes the trick­iest is­sue.

Good­hart’s law says that “When a mea­sure be­comes a tar­get, it ceases to be a good mea­sure.” But ev­ery ac­tion we take is a move­ment to­wards a tar­get! And com­plex goals are not go­ing to do as tar­gets to be di­rectly aimed at with­out the ex­ten­sive use of prox­ies. The role of hu­man lan­guage is largely to im­press other hu­mans. So when we say that we value hap­piness, mean­ing, a great civ­i­liza­tion, or what­ever, that sounds im­pres­sive and cool to other hu­mans, but it says very lit­tle about what mus­cle move­ments need to be made or what lines of code need to be writ­ten.