A frame­work for think­ing about wireheading

[Epistemic status: Writ­ing down in words some­thing that isn’t very com­plic­ated, but is still good to have writ­ten down.]

A great deal of ink has been spilled over the pos­sib­il­ity of a suf­fi­ciently in­tel­li­gent AI no­ti­cing that whatever it has been pos­it­ively re­in­forced for seems to be *very* strongly cor­rel­ated with the float­ing point value stored in its memory la­belled “util­ity func­tion”, and so through some un­au­thor­ized mech­an­ism edit­ing this value and de­fend­ing it from be­ing ed­ited back in some man­ner hos­tile to hu­mans. I’ll reason here by ana­logy with hu­mans, while agree­ing that they might not be the best ex­ample.

“Head­wires” are (given a little thought) not dif­fi­cult to ob­tain for hu­mans—heroin is freely avail­able on the black mar­ket, and most hu­mans know that, when de­livered into the blood­stream, it gen­er­ates “re­ward sig­nal”. Yet most have no de­sire to try it. Why is this?

Ask­ing any hu­man, they will an­swer some­thing along the lines of ”be­com­ing ad­dicted to heroin will not help me achieve my goals” ( or some proxy for this: spend­ing all your money and be­com­ing home­less is not very help­ful in achiev­ing one’s goals for most val­ues of “goals”.) Whatever the ef­fects of heroin, the ac­tual pain and pleas­ure that the hu­man brains have ex­per­i­enced has led us to be­come op­tim­izers of very dif­fer­ent things, which a state of such poverty is not help­ful for.

Reason­ing ana­log­ously to AI, we would hope that, to avoid this, a su­per­hu­man AI trained by some kind of re­in­force­ment learn­ing has the fol­low­ing prop­er­ties:

  1. While be­ing trained on “hu­man val­ues” (good luck with that!) the AI must not be al­lowed to hack its own util­ity func­tion.

  2. Whatever local op­tima the train­ing pro­cess that gen­er­ated the AI ends up in (per­haps re­in­force­ment learn­ing of some kind) as­signs some prob­ab­il­ity to the AI op­tim­ising what we care about.

  3. (most im­port­antly) The AI real­izes that try­ing wire­head­ing will lead it to be­come an AI which prefers wire­head­ing over aim it cur­rently has, which would be det­ri­mental to this aim.

I think this is an im­port­ant enough is­sue that some em­pir­ical test­ing might be needed to shed some light. Item (3) seems to be the most dif­fi­cult to im­ple­ment; we in the real world have the be­ne­fit of ob­serving the ef­fects of hard drugs on their un­for­tu­nate vic­tims and avoid­ing them ourselves, so a multi-agent en­vir­on­ment in which our AI real­izes it is in the same situ­ation as other agents looks like a first out­line of a way for­ward here.