mAIry’s room: AI reasoning to solve philosophical problems

This post grew out of a conversation with Laurent Orseau; we were initially going to write a paper for a consciousness/​philosophy journal of some sort, but that now seems unlikely, so I thought I’d post the key ideas here.

A summary of this post can be found here—it even has some diagrams.

The central idea is that thinking in terms of AI or similar artificial agent, we can get some interesting solutions to old philosophical problems, such as the Mary’s room/​knowledge problem. In essence, simple agents exhibit similar features to Mary in the thought experiments, so (most) explanations of Mary’s experience must also apply to simple artificial agents.

To summarise:

  • Artificial agents can treat certain inputs as if the input were different from mere information.

  • This analogises loosely to how humans “experience” certain things.

  • If the agent is a more limited (and more realistic) design, this analogy can get closer.

  • There is an artificial version of Mary, mAIry, which would plausibly have something similar to what Mary experiences within the thought experiment.

Edit: See also orthonormal’s sequence here.

Mary’s Room and the Knowledge problem

In this thought experiment, Mary has been confined to a grey room from birth, exploring the outside world only through a black-and-white monitor.

Though isolated, Mary is a brilliant scientist, and has learnt all there is to know about light, the eye, colour theory, human perception, and human psychology. It would seem that she has all possible knowledge that there could be about colour, despite having never seen it.

Then one day she gets out of her room, and says “wow, so that’s what purple looks like!”.

Has she learnt anything new here? If not, what is her exclamation about? If so, what is this knowledge—Mary was supposed to know everything there was to know about colour already?

Incidentally, I chose “purple” as the colour Mary would see, as the two colours most often used, red and blue, lead to the confusion as to what “seeing red/​blue” means—is this about the brain, or is it about the cones in the eye. But seeing purple is strictly about perception in the brain.

Example in practice

Interestingly, there are real example of Mary’s room-like situations. Some people with red-green colour-blindness can suddenly start seeing new colours with the right glasses. Apparently this happens because the red and green cones in their eyes are almost identical, so tend to always fire together. But “almost” is not “exactly”, and the glasses force green and red colours apart, so the red and green cones start firing separately, allowing the colour blind to see or distinguish new colours.

Can you feel my pain? The AI’s reward channel

This argument was initially presented here.


Let’s start with the least human AI we can imagine: AIXI, which is more an equation than an agent. Because we’ll be imagining multiple agents, let’s pick any computable version of AIXI, such as AIXItl.

There will be two such AIXItl’s, called and , and they will share observations and rewards: at turn , this will be , , and , with the reward of and the reward of .

To simplify, we’ll ignore the game theory between the agents; each agent will treat the other as part of the environment and attempt to maximise their reward around this constraint.

Then it’s clear that, even though and are both part of each agent’s observation, each agent will treat their own reward in a special way. Their actions are geared to increasing their own reward; might find informative, but has no use for it beyond that.

For example, might sacrifice current to get information that could lead it to increase ; it would never do so to increase . It would sacrifice all -rewards to increase the expected sum of ; indeed it would sacrifice its knowledge of entirely to increase that expected sum by the tiniest amount. And would be in the exact opposite situation.

The agent would also do other things, like sacrificing in counterfactual universes to increase in this one. It would also refuse the following trade: perfect knowledge of the ideal policy that would have maximised expected , in exchange for the being set to from then on. In other words, it won’t trade for perfect information about .

So what are these reward channels to these agents? It would go too far to call them qualia, but they do seem to have some features of pleasure/​pain in humans. We don’t feel the pleasure and pain of others in the same way we feel our own. We don’t feel counterfactual pain as we feel real pain; and we certainly wouldn’t agree to suffer maximal pain in exchange for knowing how we could have otherwise felt maximal pleasure. Pleasure and pain can motivate us to action in ways that few other things can: we don’t treat them as pure information.

Similarly, the doesn’t treat purely as information either. To stretch the definition of a word, we might say that is experiencing in ways that it doesn’t experience or .

Let’s try and move towards a more human-like agent.

TD-Lambda learning

TD stands for temporal difference learning: learning by the difference between a predicted reward and the actual reward. For the TD-Lambda algorithm, the agent uses : the estimated value of the state . It then goes on its merry way, and as it observes histories of the form , it updates is estimate of all its past (with a discount factor of for more distant past states ).

Again, imagine there are two agents, and , with separate reward functions and , and that each agent gets to see the other’s reward.

What happens when encounters an unexpectedly large or small value of ? Well, how would it interpret the in the first place? Maybe as part of the state-data . In that case, an unexpected moves to a new, potentially unusual state , rather than an expected . But this is only relevant if is very different from : in other words, unexpected are only relevant if they imply something about expected . And even when they do, their immediate impact is rather small: a different state reached.

Compare what happens when encounters an unexpectedly large or small value of . The impact of that is immediate: the information percolates backwards, updating all the . There is an immediate change to the inner variables all across the agent’s brain.

In this case, the ‘experience’ of the agent encountering high/​low resembles our own experience of extreme pleasure/​pain: immediate involuntary re-wiring and change of estimates through a significant part of our brain.

We could even give a certain way of ‘knowing’ that high/​low might be incoming; maybe there’s a reliability score for , or some way of tracking variance in the estimate. Then a low reliability or high variance score could indicate to the that high/​low might happen (maybe these could feed into the learning rate ). But, even if the magnitude of the is not unexpected, it will still cause changes across all the previous estimates—even if these changes are in some sense expected.

mAIry in its room

So we’ve established that artificial agents can treat certain classes of inputs in a special way, “experiencing” their data (for lack of a better word) in a way that is different from simple information. And sometimes these inputs can strongly rewire the agent’s brain/​variable values.

Let’s now turn back to the initial thought experiment, and posit that we have a mAIry, an AI version of Mary, similarly brought up without the colour purple. mAIry stores knowledge as weights in a neural net, rather than connections of neurons, but otherwise the thought experiment is very similar.

mAIry knows everything about light, cameras, and how neural nets interpret concepts, including colour. It knows that, for example, “seeing purple” corresponds to a certain pattern of activation in the neural net. We’ll simplify, and just say that there’s a certain node such that, if its activation reaches a certain threshold, the net has “seen purple”. mAIry is aware of this fact, and can identify the node within itself, and perfectly predict the sequence of stimuli that could activate it.

If mAIry is still a learning agent, then seeing a new stimuli for the first time is likely to cause a lot of changes in the weights in its nodes; again, these are changes that mAIry can estimate and predict. Let be a Boolean corresponding to whether these changes have happened or not.

What dreams of purple may come...

A sufficiently smart mAIry might be able to force itself to “experience” seeing purple, without ever having seen it. If it has full self-modification powers, it could manually activate and cause the changes that result in being true. With more minor abilities, it could trigger some low-level neurons that caused a similar change in its neural net.

In terms of the human Mary, these would correspond to something like self-brain surgery and self-hypnosis (or maybe self-induced dreams of purple).

Coming out of the room: the conclusion

So now assume that mAIry exits the room for the first time, and sees something purple. It’s possible that mAIry has successfully self-modified to activate and set to true. In that case, upon seeing something purple, mAIry gets no extra information, no extra knowledge, and nothing happens in its brain that could correspond to a “wow”.

But what if mAIry has not been able to self-modify? Then upon seeing a purple flower, the node is strongly activated for the first time, and a whole series of weight changes flow across mAIry’s brain, making true.

That is the “wow” moment for mAIry. Both mAIry and Mary have experienced something; something they both perfectly predicted ahead of time, but something that neither could trigger ahead of time, nor prevent from happening when they did see something purple. The novel activation of and the changes labelled by were both predictable and unavoidable for a smart mAIry without self-modification abilities.

At this point the analogy I’m trying to draw should be clear: activating and the unavoidable changes in the weights that causes to be true, are similar to what a TD-Lambda agent goes through when encountering unexpectedly high or low rewards. They are a “mental experience”, unprecedented for the agent even if entirely predictable.

But they are not evidence for epiphenomenalism or against physicalism—unless we want to posit that mAIry is non-physical or epiphenomenal.

It is interesting, though, that this argument suggests that qualia are very real, and distinct from pure information, though still entirely physical.