Wireheading is in the eye of the beholder

tl;dr: there is no nat­u­ral cat­e­gory called “wire­head­ing”, only wire­head­ing rel­a­tive to some de­sired ideal goal.

Sup­pose that we have a built an AI, and have in­vited a hu­man H to help test it. The hu­man H is sup­posed to press a but­ton B if the AI seems to be be­hav­ing well. The AI’s re­ward is en­tirely de­ter­mined by whether H presses B or not.

So the AI ma­nipu­lates or tricks H into press­ing B. A clear case of the AI wire­head­ing it­self.

Or is it? Sup­pose H was a med­dle­some gov­ern­ment in­spec­tor that we wanted to keep away from our re­search. Then we want H to press B, so we can get them our of our hair. In this case, the AI is be­hav­ing en­tirely in ac­cor­dance with our prefer­ences. There is no wire­head­ing in­volved.

Same soft­ware, do­ing the same be­havi­our, and yet the first is wire­head­ing and the sec­ond isn’t. What gives?

Well, ini­tially, it seemed that press­ing the but­ton was a proxy goal for our true goal, so ma­nipu­lat­ing H to press it was wire­head­ing, since that wasn’t what we in­tended. But in the sec­ond case, the proxy goal is the true goal, so max­imis­ing that proxy is not wire­head­ing, it’s effi­ciency. So it seems that the defi­ni­tion of wire­head­ing is only rel­a­tive to what we ac­tu­ally want to ac­com­plish.

In other domains

I similarly have the feel­ing that wire­head­ing-style failures in value-learn­ing, low im­pact, and cor­rigi­bil­ity, also de­pend on a speci­fi­ca­tion of our val­ues and prefer­ences—or at least a par­tial speci­fi­ca­tion. The more I dig into these ar­eas, the more I’m con­vinced they re­quire par­tial value speci­fi­ca­tion in or­der to work—they are not fully value-ag­nos­tic.