I consider wireheading to be a special case of proxy alignment in a mesaoptimiser.
Proxy alignment. The basic idea of proxy alignment is that a mesa-optimizer can learn to optimize for some proxy of the base objective instead of the base objective itself.
Suppose the base objective was to increase atmospheric pressure. One effect of increased atmospheric pressure is that less cosmic radiation reaches the ground, (more air to block it). So an AI whose mesa goal was to protect earth from radiation would be a proxy aligned agent. It has the failure mode of surrounding earth in an iron shell to block radiation. Note that this failure can happen whether or not the AI has any radiation sensors. An agent that wants to protect earth from radiation did well enough on the training, and now that is what it will do, protect the earth from radiation.
An agent with the mesa goal of maximizing pressure near all barometers would put them all in a pressure dome. (Or destroy all barometers and drop one “barometer” into the core of Jupiter.)
An agent with the mesa goal of maximizing the reading on all barometers would be the same. That agent will go around breaking all the worlds barometers.
Another mesa objective that you could get is to maximize the number on this reward counter in this computer chip here.
Wireheading is a special case of a proxy aligned mesa optimizer where the mesa objective is something to do with the agents own workings.
As with most real world categories, “something to do with” is a fuzzy concept. There are mesa objectives that are clear instances of wireheading, and ones that are clearly not and borderline cases. This is about word definitions, not real world uncertainty.
If anyone can describe a situation in which wireheading would occur that wasn’t a case of mesa optimiser misalignment, then I would have to rethink this. (Obviously you can build an agent with the hard coded goal of maximizing some feature of its own circuitry, with no mesa optimization.)
I consider wireheading to be a special case of proxy alignment in a mesaoptimiser.
Suppose the base objective was to increase atmospheric pressure. One effect of increased atmospheric pressure is that less cosmic radiation reaches the ground, (more air to block it). So an AI whose mesa goal was to protect earth from radiation would be a proxy aligned agent. It has the failure mode of surrounding earth in an iron shell to block radiation. Note that this failure can happen whether or not the AI has any radiation sensors. An agent that wants to protect earth from radiation did well enough on the training, and now that is what it will do, protect the earth from radiation.
An agent with the mesa goal of maximizing pressure near all barometers would put them all in a pressure dome. (Or destroy all barometers and drop one “barometer” into the core of Jupiter.)
An agent with the mesa goal of maximizing the reading on all barometers would be the same. That agent will go around breaking all the worlds barometers.
Another mesa objective that you could get is to maximize the number on this reward counter in this computer chip here.
Wireheading is a special case of a proxy aligned mesa optimizer where the mesa objective is something to do with the agents own workings.
As with most real world categories, “something to do with” is a fuzzy concept. There are mesa objectives that are clear instances of wireheading, and ones that are clearly not and borderline cases. This is about word definitions, not real world uncertainty.
If anyone can describe a situation in which wireheading would occur that wasn’t a case of mesa optimiser misalignment, then I would have to rethink this. (Obviously you can build an agent with the hard coded goal of maximizing some feature of its own circuitry, with no mesa optimization.)
I agree. I’ve now added this line, which I thought I’d put in the original post, but apparently missed out: