Deconfusing Deception

What does deception look like from the outside? I notice I am confused.

Imagine you have some algorithm which learns to predict sensory inputs, like a human. The internal structure of this will come to correspond to the external world in some way, as a part of generating a predictive model. Imagine the algorithm is walking around a building, looking at the objects in it. Most of the objects are ordinary, made-of-atoms objects. The algorithm learns to predict where the objects are, how they interact, and basically gets very low input-prediction error.

Then it comes across a teapot. The teapot isn’t a real object, but in fact a hallucination projected onto the sensory arrays of the algorithm by a daemon. At first, the daemon only has to project some visual data. But then the algorithm picks up the teapot, so the daemon must project some tactile data.

For a “successful” deception, the algorithm must not “notice” that their purely-physical model of the world is being violated.

When the algorithm decides to make a cup of tea, the daemon must not only fake the sensations of a full, brewing pot of tea, but hide the sight of the water and teabag falling onto the floor. Later, the algorithm falls over by slipping on the spilled water, so the daemon must from then on fabricate all the sensory data that the algorithm gets.

Or, the daemon could fabricate some other reason for the sensory data. The water on the floor could have leaked in from a crack in the ceiling. For this to work, the algorithm must have already been uncertain about whether there was a crack in the ceiling or not, and the daemon must have known this.

This highlights two ways you can deceive a learning algorithm. Both involve control of all the information flowing between a system (in this case the whole world) and an algorithm. One requires modelling of the system, the other requires modelling the algorithm.

How else can we look at this? It seems like if there’s a daemon making a fake teapot, the process of moving information $w o r l d \to m o d e l (w o r l d)$ has a discontinuity around the part of the algorithm defining a model of the teapot. The reason that a real teapot leads to a model of the teapot in the algorithm follows a different-looking causal chain to the reason a daemon leads to a model of the teapot in the algorithm.

This discontinuity defines a region which grows as the fake object interacts with the real ones. This can be expanded to fill the whole of the system external to the algorithm, or be contracted to zero. I hypothesize that these are the only two stable states of the region, every other boundary is broken by leaky abstractions.

The reason this feels like deception is that we also have a bit of self-knowledge which looks like $m o d e l (w o r l d \to m o d e l (w o r l d))$ . When we’re deceived this is violated on a specific, local scale. Hence the discontinuity in $w o r l d \to m o d e l (w o r l d)$ .

Whether or not this discontinuity can be distinguished from other discontinuities in the $w o r l d \to m o d e l (w o r l d)$ function I’m not sure. At a first glance it seems different but I notice I am confused.

If these properties hold for everything humans would consider deception, this is important. One problem with translating ontologies (which is importantly relevant for the Eliciting Latent Knowledge problem) is how to identify deception. If for any learning algorithm we can specify a single world-plus-algorithm-state which has no deception, and specify which processes for changing that into other world-plus-algorithm preserve the no-deception property, then we can specify a “natural ontology” for that learning algorithm.