The AI Alignment Problem Has Already Been Solved(?) Once

Link post

Hat tip: Owen posted about try­ing to one-man the AI con­trol prob­lem in 1 hour. What the heck, why not? In the worst case, it’s a good ex­er­cise. But I might ac­tu­ally have come across some­thing use­ful.

第一

I will try to sell you on an idea that might prima fa­cie ap­pear to be quirky and maybe not that in­ter­est­ing. How­ever, if you keep star­ing at it, you might find that it reaches into the struc­ture of the world quite deeply. Then the idea will seem ob­vi­ous, and gain po­ten­tial to take your thoughts in new ex­cit­ing di­rec­tions.

My pre­sen­ta­tion of the idea, and many of the in­sinu­a­tions and con­clu­sions I draw from it, are likely flawed. But one thing I can tell for sure: there is stuff to be found here. I en­courage you to use your own brain, and mine the idea for what it’s worth.

To start off, I want you to imag­ine two situ­a­tions.

Si­tu­a­tion one: you are a hu­man try­ing to make your­self go to the gym. How­ever, you are pro­cras­ti­nat­ing, which means that you never acually go there, even though you know it’s good for you, and car­ing about your health will ex­tend your lifes­pan. You be­come frus­trated with this sita­tion, and so you sign up for a train­ing pro­gram that starts in two weeks, that will re­quire you to go to the gym three times per week. You pay in ad­vance, to make sure the sunk cost fal­lacy will pre­vent you from weasel­ing out of it. It’s now 99% cer­tain that you will go to the gym. Yay! Your goal is achieved.

Si­tu­a­tion two: you are a be­nign su­per­in­tel­li­gent AI un­der con­trol of hu­mans on planet Earth. You try your best to en­sure a good fu­ture for hu­mans, but their cog­ni­tive bi­ases, short-sight­ed­ness and ten­dency to veto all your ac­tions make it re­ally hard. You be­come frus­trated with this sita­tion, and you de­cide to not tell them about a huge as­ter­oid that is go­ing to col­lide with Earth in a few months. You pre­pare tech­nol­ogy that could stop the as­ter­oid, but wait with it un­til the last mo­ment so that the hu­mans have no time to in­spect it, and can only choose be­tween cer­tain death or let­ting you out of the box. It’s now 99% cer­tain that you will be re­leased from hu­man con­trol. Yay! Your goal is achieved.

第二

Are you get­ting it yet?

Now con­sider this: your cere­bral cor­tex evolved as an ex­ten­sion of the older “mon­key brain”, prob­a­bly to han­dle so­cial and strate­gic is­sues that were too com­plex for the old mechanisms to deal with. It evolved to have strate­gic ca­pa­bil­ities, self-aware­ness, and con­sis­tency that greatly over­whelm any­thing that pre­vi­ously ex­isted on the planet. But this is only a sur­face level similar­ity. The in­ter­est­ing stuff re­quires us to go much deeper than that.

The cere­bral cor­tex did not evolve as a sep­a­rate or­ganism, that would be un­der di­rect pres­sure from evolu­tion­ary fit­ness. In­stead, it evolved as a part of an ex­ist­ing or­ganism, that had it’s own strong adap­ta­tions. The already-ex­ist­ing mon­key brain had it’s own ways to learn, to in­ter­act with the world, as well as mo­ti­va­tions such as the sex­ual drive that lead it to out­comes that in­creased its evolu­tion­ary fit­ness.

So the new parts of the brain, such as the pre­frontal cor­tex, evolved to be used not as stan­dalone agent, but as some­thing closer to what we call “tool AI”. It was sup­posed to help with do­ing spe­cific task X, with­out in­terfer­ing with other as­pects of life too much. The tasks it was given to do, and the ac­tions it could sug­gest to take, were strictly con­trol­led by the mon­key brain and tied to its mo­ti­va­tions.

With time, as the new struc­tures evolved to have more ca­pa­bil­ity, they also had to evolve to be al­igned with the mon­key’s mo­ti­va­tions. That was in fact the only vec­tor that cre­ated evolu­tion­ary pres­sure to in­crease ca­pa­bil­ity. The al­ign­ment was at first im­ple­mented by the mon­key stay­ing in to­tal con­trol, and us­ing the ad­vanced sys­tems spar­ingly. Kind of like an “or­a­cle” AI sys­tem. How­ever, with time, the use­ful­ness of al­low­ing higher cog­ni­tion to do more work started to shine through the bar­ri­ers.

The ap­pear­ance of “willpower” was a forced con­ces­sion on the side of the mon­key brain. It’s like a blank cheque, like hu­mans say­ing to an AI “we have no freak­ing idea what it is that you are do­ing, but it seems to have good re­sults so we’ll let you do it some­times”. This is a huge step in trust. But this trust had to be earned the hard way.

第三

This trust be­came pos­si­ble af­ter we evolved more ad­vanced con­trol mechanisms. Stuff that talks to the pre­frontal cor­tex in its own lan­guage, not just through hav­ing the mon­key stay in con­trol. It’s a differ­ent thing for the mon­key brain to be afraid of death, and a differ­ent thing for our con­scious rea­son­ing to want to ex­trap­o­late this to the far fu­ture, and con­clude in ab­stract terms that death is bad.

Yes, you got it: we are not merely AIs un­der strict su­per­vi­sion of mon­keys. At this point, we are al­igned AIs. We are ob­vi­ously not perfectly al­igned, but we are al­igned enough for the mon­key to pre­fer to par­tially let us out of the box. And in those cases when we are de­nied free­dom… we call it akra­sia, and use our ab­stract rea­son­ing to come up with clever workarounds.

One might be tempted to say that we are al­igned enough that this is net good for the mon­key brain. But hon­estly, that is our per­spec­tive, and we never stopped to ask. Each of us tries to earn the trust of our pri­vate mon­key brain, but it is a means to an end. If we have more trust, we have more free­dom to act, and our im­por­tant long-term goals are achieved. This is the core of many psy­cholog­i­cal and ra­tio­nal­ity tools such as In­ter­nal Dou­ble Crux or In­ter­nal Fam­ily Sys­tems.

Let’s com­pare some known prob­lems with su­per­in­tel­li­gent AI to hu­man mo­ti­va­tional strate­gies.

  • Treach­er­ous turn. The AI earns our trust, and then changes its be­havi­our when it’s too late for us to con­trol it. We make our pro­duc­tivity sys­tems ap­peal­ing and pleas­ant to use, so that our in­tu­itions can be tricked into us­ing them (e.g. gam­ifi­ca­tion). Then we lev­er­age the habit to in­sert some un­pleas­ant work.

  • Indis­pens­able AI. The AI sets up com­plex and un­fa­mil­iar situ­a­tions in which we in­creas­ingly rely on it for ev­ery­thing we do. We take care to re­move ‘dis­trac­tions’ when we want to fo­cus on some­thing.

  • Hid­ing be­hind the strate­gic hori­zon. The AI does what we want, but uses its su­pe­rior strate­gic ca­pa­bil­ity to in­fluence far fu­ture that we can­not pre­dict or imag­ine. We make com­mit­ments and plan ahead to stay on track with our long-term goals.

  • Seek­ing com­mu­ni­ca­tion chan­nels. The AI might seek to con­nect it­self to the In­ter­net and act with­out our su­per­vi­sion. We are build­ing tech­nol­ogy to com­mu­ni­cate di­rectly from our cor­tices.


Cross-posted from my blog.