Censoring out-of-domain representations

(An idea from a recent MIRI research workshop; similar to some ideas of Eric Drexler and others. Would need further development before it’s clear whether this would do anything interesting, let alone be a reliable part of a taskifying approach.)

If you take an AI capable of pretty general reasoning, and you ask it to do a task in a restricted domain, you can’t necessarily count on it to find a solution that stays conceptually within that domain. This covers a couple of different failure modes, including wireheading (the AI modifies its own sensory inputs rather than the external object you wanted it to modify) and manipulation of humans (the AI influences human psychology rather than the thing you wanted it to directly affect).

Directly forbidding actions outside a domain seems tricky if we can’t define the domain in closed form (this is especially the case if we have an AI affecting humans, and we don’t want it to develop models of human psychology sufficient for manipulating them). One thing we could try instead is ensuring that the AI doesn’t “know too much”, in its internal representations, outside its domain.

“Know too much”, here, can be defined in the sense of Censoring Representations with an Adversary. We add to our network some additional networks seeking to extract out-of-domain information from internal representations, and then by reversing the sign of the backprop from that network to the representation, we cause it to erase such information.

Furthermore, so that we don’t destroy any hope of decent task performance, we can erase out-of-domain information only to the degree that it exceeds the info from some known-safe “black box” model of the environment outside the domain.

Examples of tasks that we might want to use this on:

  • Play well on a suite of interactive fiction games, but without understanding human psychology any better than some baseline natural language model (e.g. test whether it can predict text conversations between humans).

  • Optimize the newsfeed algorithm for a social network, as measured by analytics data, again without learning human psychology beyond some simple baseline model.

  • Do engineering or chemistry within a physically enclosed box, without learning anything about the world outside that box beyond a simple model of it (e.g. ask it questions about external objects, including the AI’s own hardware, which you’ve put outside the box for obvious reasons).

Question: Is there an extremely simple toy example we could test out on current systems without human input?