Goal retention discussion with Eliezer

Although I feel that Nick Bostrom’s new book “Su­per­in­tel­li­gence” is gen­er­ally awe­some and a well-needed mile­stone for the field, I do have one quib­ble: both he and Steve Omo­hun­dro ap­pear to be more con­vinced than I am by the as­sump­tion that an AI will nat­u­rally tend to re­tain its goals as it reaches a deeper un­der­stand­ing of the world and of it­self. I’ve writ­ten a short es­say on this is­sue from my physics per­spec­tive, available at http://​​arxiv.org/​​pdf/​​1409.0813.pdf.

Eliezer Yud­kowsky just sent the fol­low­ing ex­tremely in­ter­est­ing com­ments, and told me he was OK with me shar­ing them here to spur a broader dis­cus­sion of these is­sues, so here goes.

On Sep 3, 2014, at 17:21, Eliezer Yud­kowsky <yud­kowsky@gmail.com> wrote:

Hi Max! You’re ask­ing the right ques­tions. Some of the an­swers we can­
give you, some we can’t, few have been writ­ten up and even fewer in any­
well-or­ga­nized way. Benja or Nate might be able to ex­pound in more de­tail
while I’m in my seclu­sion.

Very briefly, though:
The prob­lem of util­ity func­tions turn­ing out to be ill-defined in light of
new dis­cov­er­ies of the uni­verse is what Peter de Blanc named an
”on­tolog­i­cal crisis” (not nec­es­sar­ily a par­tic­u­larly good name, but it’s
what we’ve been us­ing lo­cally).


The way I would phrase this prob­lem now is that an ex­pected util­i­ty­
max­i­mizer makes com­par­i­sons be­tween quan­tities that have the type
”ex­pected util­ity con­di­tional on an ac­tion”, which means that the AI’s
u­til­ity func­tion must be some­thing that can as­sign util­ity-num­bers to the
AI’s model of re­al­ity, and these num­bers must have the fur­ther prop­er­ty
that there is some com­pu­ta­tion­ally fea­si­ble ap­prox­i­ma­tion for calcu­lat­ing­
ex­pected util­ities rel­a­tive to the AI’s prob­a­bil­is­tic be­liefs. This is a
con­straint that rules out the vast ma­jor­ity of all com­pletely chaotic an­d
un­in­ter­est­ing util­ity func­tions, but does not rule out, say, “make lots of­

Models also have the prop­erty of be­ing Bayes-up­dated us­ing sen­so­ry­
in­for­ma­tion; for the sake of dis­cus­sion let’s also say that mod­els are­
about uni­verses that can gen­er­ate sen­sory in­for­ma­tion, so that the­se­
mod­els can be prob­a­bil­is­ti­cally falsified or con­firmed. Then an
”on­tolog­i­cal crisis” oc­curs when the hy­poth­e­sis that best fits sen­so­ry­
in­for­ma­tion cor­re­sponds to a model that the util­ity func­tion doesn’t run
on, or doesn’t de­tect any util­ity-hav­ing ob­jects in. The ex­am­ple of
”im­mor­tal souls” is a rea­son­able one. Sup­pose we had an AI that had a
nat­u­ral­is­tic ver­sion of a Solomonoff prior, a lan­guage for spec­i­fy­in­g
u­ni­verses that could have pro­duced its sen­sory data. Sup­pose we tried to­
give it a util­ity func­tion that would look through any given model, de­tect­
things cor­re­spond­ing to im­mor­tal souls, and value those things. Even if
the im­mor­tal-soul-de­tect­ing util­ity func­tion works perfectly (it would in­
fact de­tect all im­mor­tal souls) this util­ity func­tion will not de­tec­t
any­thing in many (rep­re­sen­ta­tions of) uni­verses, and in par­tic­u­lar it will
not de­tect any­thing in the (rep­re­sen­ta­tions of) uni­verses we think have­
most of the prob­a­bil­ity mass for ex­plain­ing our own world. In this case
the AI’s be­hav­ior is un­defined un­til you tell me more things about the AI;
an ob­vi­ous pos­si­bil­ity is that the AI would choose most of its ac­tions­
based on low-prob­a­bil­ity sce­nar­ios in which hid­den im­mor­tal souls ex­ist­ed
that its ac­tions could af­fect. (Note that even in this case the util­i­ty
func­tion is sta­ble!)

Since we don’t know the fi­nal laws of physics and could eas­ily be­
sur­prised by fur­ther dis­cov­er­ies in the laws of physics, it seems pret­ty­
clear that we shouldn’t be spec­i­fy­ing a util­ity func­tion over ex­act­
phys­i­cal states rel­a­tive to the Stan­dard Model, be­cause if the Stan­dard­
Model is even slightly wrong we get an on­tolog­i­cal crisis. Of course
there are all sorts of ex­tremely good rea­sons we should not try to do this
any­way, some of which are touched on in your draft; there just is no
sim­ple func­tion of physics that gives us some­thing good to max­i­mize. See
also Com­plex­ity of Value, Frag­ility of Value, in­di­rect nor­ma­tivity, the­
w­hole rea­son for a drive be­hind CEV, and so on. We’re al­most cer­tain­ly­
go­ing to be us­ing some sort of util­ity-learn­ing al­gorithm, the learne­d
u­til­ities are go­ing to bind to mod­eled fi­nal physics by way of mod­eled­
higher lev­els of rep­re­sen­ta­tion which are known to be im­perfect, and we’re­
go­ing to have to figure out how to pre­serve the model and learne­d
u­til­ities through shifts of rep­re­sen­ta­tion. E.g., the AI dis­cov­ers that
hu­mans are made of atoms rather than be­ing on­tolog­i­cally fun­da­men­tal
hu­mans, and fur­ther­more the AI’s multi-level rep­re­sen­ta­tions of re­al­i­ty
e­volve to use a differ­ent sort of ap­prox­i­ma­tion for “hu­mans”, but that’s
okay be­cause our util­ity-learn­ing mechanism also says how to re-bind the­
learned in­for­ma­tion through an on­tolog­i­cal shift.

This sorta thing ain’t go­ing to be easy which is the other big rea­son to
start work­ing on it well in ad­vance. I point out how­ever that this­
doesn’t seem un­think­able in hu­man terms. We dis­cov­ered that brains are­
made of neu­rons but were nonethe­less able to main­tain an in­tu­itive grasp
on what it means for them to be happy, and we don’t throw away all that­
info each time a new phys­i­cal dis­cov­ery is made. The kind of cog­ni­tion we­
want does not seem in­her­ently self-con­tra­dic­tory.

Three other quick re­marks:

*) Nat­u­ral se­lec­tion is not a con­se­quen­tial­ist, nor is it the sort of­
con­se­quen­tial­ist that can suffi­ciently pre­cisely pre­dict the re­sults of­
mod­ifi­ca­tions that the ba­sic ar­gu­ment should go through for its sta­bil­ity.
The Omo­hun­drian/​Yud­kowskian ar­gu­ment is not that we can take an ar­bi­trary
s­tupid young AI and it will be smart enough to self-mod­ify in a way that­
p­re­serves its val­ues, but rather that most AIs that don’t self-de­struc­t
will even­tu­ally end up at a sta­ble fixed-point of co­her­ent­
con­se­quen­tial­ist val­ues. This could eas­ily in­volve a step where, e.g., an
AI that started out with a neu­ral-style delta-rule policy-re­in­force­ment
learn­ing al­gorithm, or an AI that started out as a big soup of­
self-mod­ify­ing heuris­tics, is “taken over” by what­ever part of the AI
first learns to do con­se­quen­tial­ist rea­son­ing about code. But this
pro­cess doesn’t re­peat in­definitely; it sta­bi­lizes when there’s a
con­se­quen­tial­ist self-mod­ifier with a co­her­ent util­ity func­tion that can­
precisely pre­dict the re­sults of self-mod­ifi­ca­tions. The part where this­
does hap­pen to an ini­tial AI that is un­der this thresh­old of sta­bil­ity is
a big part of the prob­lem of Friendly AI and it’s why MIRI works on tilin­g
a­gents and so on!

*) Nat­u­ral se­lec­tion is not a con­se­quen­tial­ist, nor is it the sort of­
con­se­quen­tial­ist that can suffi­ciently pre­cisely pre­dict the re­sults of­
mod­ifi­ca­tions that the ba­sic ar­gu­ment should go through for its sta­bil­ity.
It built hu­mans to be con­se­quen­tial­ists that would value sex, not val­ue
in­clu­sive ge­netic fit­ness, and not value be­ing faith­ful to nat­u­ral
s­e­lec­tion’s op­ti­miza­tion crite­rion. Well, that’s dumb, and of course the
re­sult is that hu­mans don’t op­ti­mize for in­clu­sive ge­netic fit­ness.
Nat­u­ral se­lec­tion was just stupid like that. But that doesn’t mean­
there’s a generic pro­cess whereby an agent re­jects its “pur­pose” in the­
light of ex­oge­nously ap­pear­ing prefer­ence crite­ria. Nat­u­ral se­lec­tion’s
an­thro­po­mor­phized “pur­pose” in mak­ing hu­man brains is just not the same as­
the cog­ni­tive pur­poses rep­re­sented in those brains. We’re not talk­ing
about spon­ta­neous re­jec­tion of in­ter­nal cog­ni­tive pur­poses based on their­
causal ori­gins failing to meet some ex­oge­nously-ma­te­ri­al­iz­ing crite­rion of­
val­idity. Our re­jec­tion of “max­i­mize in­clu­sive ge­netic fit­ness” is not an
ex­oge­nous re­jec­tion of some­thing that was ex­plic­itly rep­re­sented in us,
that we were ex­plic­itly be­ing con­se­quen­tial­ists for. It’s a re­jec­tion of­
some­thing that was never an ex­plic­itly rep­re­sented ter­mi­nal value in the
first place. Similarly the sta­bil­ity ar­gu­ment for suffi­ciently ad­vanced­
self-mod­ifiers doesn’t go through a step where the suc­ces­sor form of the
AI rea­sons about the in­ten­tions of the pre­vi­ous step and re­spects them
a­part from its con­structed util­ity func­tion. So the lack of any uni­ver­sal
prefer­ence of this sort is not a gen­eral ob­sta­cle to sta­ble­

*) The case of nat­u­ral se­lec­tion does not illus­trate a uni­ver­sal­
com­pu­ta­tional con­straint, it illus­trates some­thing that we coul­d
an­thro­po­mor­phize as a fool­ish de­sign er­ror. Con­sider hu­mans build­ing Deep­
Blue. We built Deep Blue to at­tach a sort of de­fault value to queens and­
cen­tral con­trol in its po­si­tion eval­u­a­tion func­tion, but Deep Blue is
still perfectly able to sac­ri­fice queens and cen­tral con­trol al­ike if the­
p­o­si­tion reaches a check­mate thereby. In other words, al­though an agent­
needs crys­tal­lized in­stru­men­tal goals, it is also perfectly rea­son­able to­
have an agent which never know­ingly sac­ri­fices the ter­mi­nally define­d
u­til­ities for the crys­tal­lized in­stru­men­tal goals if the two con­flict;
in­deed “in­stru­men­tal value of X” is sim­ply “prob­a­bil­is­tic be­lief that X
leads to ter­mi­nal util­ity achieve­ment”, which is sen­si­bly re­vised in the
p­res­ence of any over­rid­ing in­for­ma­tion about the ter­mi­nal util­ity. To put
it an­other way, in a ra­tio­nal agent, the only way a loose gen­er­al­iza­tion­
about in­stru­men­tal ex­pected-value can con­flict with and trump ter­mi­nal
ac­tual-value is if the agent doesn’t know it, i.e., it does some­thing that
it rea­son­ably ex­pected to lead to ter­mi­nal value, but it was wrong.

This has been very off-the-cuff and I think I should hand this over to
Nate or Benja if fur­ther replies are needed, if that’s all right.