A framework for thinking about wireheading

[Epistemic sta­tus: Writ­ing down in words some­thing that isn’t very com­pli­cated, but is still good to have writ­ten down.]

A great deal of ink has been spilled over the pos­si­bil­ity of a suffi­ciently in­tel­li­gent AI notic­ing that what­ever it has been pos­i­tively re­in­forced for seems to be *very* strongly cor­re­lated with the float­ing point value stored in its mem­ory la­bel­led “util­ity func­tion”, and so through some unau­tho­rized mechanism edit­ing this value and defend­ing it from be­ing ed­ited back in some man­ner hos­tile to hu­mans. I’ll rea­son here by anal­ogy with hu­mans, while agree­ing that they might not be the best ex­am­ple.

“Head­wires” are (given a lit­tle thought) not difficult to ob­tain for hu­mans—heroin is freely available on the black mar­ket, and most hu­mans know that, when de­liv­ered into the blood­stream, it gen­er­ates “re­ward sig­nal”. Yet most have no de­sire to try it. Why is this?

Ask­ing any hu­man, they will an­swer some­thing along the lines of ”be­com­ing ad­dicted to heroin will not help me achieve my goals” ( or some proxy for this: spend­ing all your money and be­com­ing home­less is not very helpful in achiev­ing one’s goals for most val­ues of “goals”.) What­ever the effects of heroin, the ac­tual pain and plea­sure that the hu­man brains have ex­pe­rienced has led us to be­come op­ti­miz­ers of very differ­ent things, which a state of such poverty is not helpful for.

Rea­son­ing analo­gously to AI, we would hope that, to avoid this, a su­per­hu­man AI trained by some kind of re­in­force­ment learn­ing has the fol­low­ing prop­er­ties:

  1. While be­ing trained on “hu­man val­ues” (good luck with that!) the AI must not be al­lowed to hack its own util­ity func­tion.

  2. What­ever lo­cal op­tima the train­ing pro­cess that gen­er­ated the AI ends up in (per­haps re­in­force­ment learn­ing of some kind) as­signs some prob­a­bil­ity to the AI op­ti­mis­ing what we care about.

  3. (most im­por­tantly) The AI re­al­izes that try­ing wire­head­ing will lead it to be­come an AI which prefers wire­head­ing over aim it cur­rently has, which would be detri­men­tal to this aim.

I think this is an im­por­tant enough is­sue that some em­piri­cal test­ing might be needed to shed some light. Item (3) seems to be the most difficult to im­ple­ment; we in the real world have the benefit of ob­serv­ing the effects of hard drugs on their un­for­tu­nate vic­tims and avoid­ing them our­selves, so a multi-agent en­vi­ron­ment in which our AI re­al­izes it is in the same situ­a­tion as other agents looks like a first out­line of a way for­ward here.