generalized wireheading

Link post

many systems “want” to “wirehead” — which is to say, they want to hijack, and maximize, their reward signal.

humans often want to. not always, but sometimes, and this might be true even under reflection: some people (believe they) truly axiomatically only care to be in a state where they’re satisfied, others have values about what actually happens in the world (which is actually possible and meaningful to do!).

reinforcement learning AIs such as AIXI want to wirehead: they want to just do whatever will maximize their reward. if there is a function in place that looks at the amount of happiness in the world and continuously rewards such an AI by that much, then the AI will do whatever is easiest, whether that’s do what makes that function return the highest value, or replace the function with a constant returning the maximum value. (if it does so consequentially, such as by observing that it’s more likely to get even more reward in the future by taking over the world, then it’ll still do just that, so we can’t necessarily count on wireheading to stop world-consuming AIs.)

(it’s true that “reward is not the optimization target” of learned policies — AIs that are first trained in an RL environment, and then deployed into the world without that reward mechanism. but i think it is true of agents that continuously get rewarded and trained even after deployment.)

some bad philosophical perspectives claim to want society to wirehead: they want to get a society where everyone is as satisfied as possible with how things are, without realizing that a goal like that is easily hijacked by states such as everyone wants to do nothing all day, or where everyone is individually wireheaded. we do not in fact want that: in general, we’d like the future to be interesting and have stuff going on. it is true that by happenstance we have not historically managed to turn everyone into a very easily satisfied wireheaded person (“zombie”), but that shouldn’t make us falsely believe that, purely by chance, this will never be the case. if we want to be sure we robustly don’t become zombies, we have to make sure we actually don’t implement a philosophy that would be most satisfied by zombies.

the solution to all of those, is to bite the bullet of value lock-in. there are meta-values that are high-level enough that we do in fact want them to guide the future — even within the set of highly mutable non-axiomatic values, we still have preferences for valuing some of those futures over others. past user satisfaction embodies this well as a solution: it is in fact true that i should want (the coherent extrapolated volition of) my values to determine all of the future light-cone, and this recursively takes care of everything — including adding randomness/​happenstance where it ought to be, purposefully.

just like alignment, making the mistake of saying “i just want people in the future to be satisfied!” is a mistake that can isomorphically be found in many fields, and in fact is not where we should want to steer the future, because its canonical endpoint is just something like wireheading. we want (idealized, meta-)value lock-in, not the satisfaction of whatever-will-exist. fundamentally, we want the future to satisfy the values of us now, not people/​things later.

of course, those values of us now happen to be fairly cosmopolitan and entail, instrumentally, that people in the future indeed largely be satisfied. but this ought to ultimately be under the terms of our current cosmopolitan (meta-)values, rather than a blind notion of just filling the future with things that get what they want without caring what those wants are.