This post grew out of a conversation with Laurent Orseau; we were initially going to write a paper for a consciousness/philosophy journal of some sort, but that now seems unlikely, so I thought I’d post the key ideas here.
The central idea is that thinking in terms of AI or similar artificial agent, we can get some interesting solutions to old philosophical problems, such as the Mary’s room/knowledge problem. In essence, simple agents exhibit similar features to Mary in the thought experiments, so (most) explanations of Mary’s experience must also apply to simple artificial agents.
To summarise:
Artificial agents can treat certain inputs as if the input were different from mere information.
This analogises loosely to how humans “experience” certain things.
If the agent is a more limited (and more realistic) design, this analogy can get closer.
There is an artificial version of Mary, mAIry, which would plausibly have something similar to what Mary experiences within the thought experiment.
In this thought experiment, Mary has been confined to a grey room from birth, exploring the outside world only through a black-and-white monitor.
Though isolated, Mary is a brilliant scientist, and has learnt all there is to know about light, the eye, colour theory, human perception, and human psychology. It would seem that she has all possible knowledge that there could be about colour, despite having never seen it.
Then one day she gets out of her room, and says “wow, so that’s what purple looks like!”.
Has she learnt anything new here? If not, what is her exclamation about? If so, what is this knowledge—Mary was supposed to know everything there was to know about colour already?
Incidentally, I chose “purple” as the colour Mary would see, as the two colours most often used, red and blue, lead to the confusion as to what “seeing red/blue” means—is this about the brain, or is it about the cones in the eye. But seeing purple is strictly about perception in the brain.
Example in practice
Interestingly, there are real example of Mary’s room-like situations. Some people with red-green colour-blindness can suddenly start seeing new colours with the right glasses. Apparently this happens because the red and green cones in their eyes are almost identical, so tend to always fire together. But “almost” is not “exactly”, and the glasses force green and red colours apart, so the red and green cones start firing separately, allowing the colour blind to see or distinguish new colours.
Let’s start with the least human AI we can imagine: AIXI, which is more an equation than an agent. Because we’ll be imagining multiple agents, let’s pick any computable version of AIXI, such as AIXItl.
There will be two such AIXItl’s, called Ar and Aq, and they will share observations and rewards: at turn i, this will be oi, ri, and qi, with ri the reward of Ar and qi the reward of Aq.
To simplify, we’ll ignore the game theory between the agents; each agent will treat the other as part of the environment and attempt to maximise their reward around this constraint.
Then it’s clear that, even though ri and qi are both part of each agent’s observation, each agent will treat their own reward in a special way. Their actions are geared to increasing their own reward; Ar might find qi informative, but has no use for it beyond that.
For example, Ar might sacrifice current ri to get information that could lead it to increase rj>i; it would never do so to increase qj>i. It would sacrifice all q-rewards to increase the expected sum of ri; indeed it would sacrifice its knowledge of qi entirely to increase that expected sum by the tiniest amount. And Aq would be in the exact opposite situation.
The Ar agent would also do other things, like sacrificing ri in counterfactual universes to increase ri in this one. It would also refuse the following trade: perfect knowledge of the ideal policy that would have maximised expected ri, in exchange for the ri being set to 0 from then on. In other words, it won’t trade ri for perfect information about ri.
So what are these reward channels to these agents? It would go too far to call them qualia, but they do seem to have some features of pleasure/pain in humans. We don’t feel the pleasure and pain of others in the same way we feel our own. We don’t feel counterfactual pain as we feel real pain; and we certainly wouldn’t agree to suffer maximal pain in exchange for knowing how we could have otherwise felt maximal pleasure. Pleasure and pain can motivate us to action in ways that few other things can: we don’t treat them as pure information.
Similarly, the Ar doesn’t treat ri purely as information either. To stretch the definition of a word, we might say that Ar is experiencingri in ways that it doesn’t experience qi or oi.
Let’s try and move towards a more human-like agent.
TD-Lambda learning
TD stands for temporal difference learning: learning by the difference between a predicted reward and the actual reward. For the TD-Lambda algorithm, the agent uses V(s): the estimated value of the state s. It then goes on its merry way, and as it observes histories of the form …si−1ai−1ri−1siairisi+1ai+1ri+1, it updates is estimate of all its past V(si) (with a discount factor of 0≤λ≤1 for more distant past states sj<i).
Again, imagine there are two agents, Tr and Tq, with separate reward functions r and q, and that each agent gets to see the other’s reward.
What happens when Tr encounters an unexpectedly large or small value of qi? Well, how would it interpret the qi in the first place? Maybe as part of the state-data si+1. In that case, an unexpected qi moves Tr to a new, potentially unusual state si+1, rather than an expected s′i+1. But this is only relevant if V(si+1) is very different from V(s′i+1): in other words, unexpected qi are only relevant if they imply something about expected ri. And even when they do, their immediate impact is rather small: a different state reached.
Compare what happens when Tr encounters an unexpectedly large or small value of ri. The impact of that is immediate: the information percolates backwards, updating all the V(sj<i). There is an immediate change to the inner variables all across the agent’s brain.
In this case, the ‘experience’ of the Tr agent encountering high/low ri resembles our own experience of extreme pleasure/pain: immediate involuntary re-wiring and change of estimates through a significant part of our brain.
We could even give Tr a certain way of ‘knowing’ that high/low ri might be incoming; maybe there’s a reliability score for V(si), or some way of tracking variance in the estimate. Then a low reliability or high variance score could indicate to the Tr that high/low ri might happen (maybe these could feed into the learning rate α). But, even if the magnitude of the ri is not unexpected, it will still cause changes across all the previous estimates—even if these changes are in some sense expected.
mAIry in its room
So we’ve established that artificial agents can treat certain classes of inputs in a special way, “experiencing” their data (for lack of a better word) in a way that is different from simple information. And sometimes these inputs can strongly rewire the agent’s brain/variable values.
Let’s now turn back to the initial thought experiment, and posit that we have a mAIry, an AI version of Mary, similarly brought up without the colour purple. mAIry stores knowledge as weights in a neural net, rather than connections of neurons, but otherwise the thought experiment is very similar.
mAIry knows everything about light, cameras, and how neural nets interpret concepts, including colour. It knows that, for example, “seeing purple” corresponds to a certain pattern of activation in the neural net. We’ll simplify, and just say that there’s a certain node np such that, if its activation reaches a certain threshold, the net has “seen purple”. mAIry is aware of this fact, and can identify the np node within itself, and perfectly predict the sequence of stimuli that could activate it.
If mAIry is still a learning agent, then seeing a new stimuli for the first time is likely to cause a lot of changes in the weights in its nodes; again, these are changes that mAIry can estimate and predict. Let cp be a Boolean corresponding to whether these changes have happened or not.
What dreams of purple may come...
A sufficiently smart mAIry might be able to force itself to “experience” seeing purple, without ever having seen it. If it has full self-modification powers, it could manually activate np and cause the changes that result in cp being true. With more minor abilities, it could trigger some low-level neurons that caused a similar change in its neural net.
In terms of the human Mary, these would correspond to something like self-brain surgery and self-hypnosis (or maybe self-induced dreams of purple).
Coming out of the room: the conclusion
So now assume that mAIry exits the room for the first time, and sees something purple. It’s possible that mAIry has successfully self-modified to activate np and set cp to true. In that case, upon seeing something purple, mAIry gets no extra information, no extra knowledge, and nothing happens in its brain that could correspond to a “wow”.
But what if mAIry has not been able to self-modify? Then upon seeing a purple flower, the node np is strongly activated for the first time, and a whole series of weight changes flow across mAIry’s brain, making cp true.
That is the “wow” moment for mAIry. Both mAIry and Mary have experienced something; something they both perfectly predicted ahead of time, but something that neither could trigger ahead of time, nor prevent from happening when they did see something purple. The novel activation of np and the changes labelled by cp were both predictable and unavoidable for a smart mAIry without self-modification abilities.
At this point the analogy I’m trying to draw should be clear: activating np and the unavoidable changes in the weights that causes cp to be true, are similar to what a TD-Lambda agent goes through when encountering unexpectedly high or low rewards. They are a “mental experience”, unprecedented for the agent even if entirely predictable.
But they are not evidence for epiphenomenalism or against physicalism—unless we want to posit that mAIry is non-physical or epiphenomenal.
It is interesting, though, that this argument suggests that qualia are very real, and distinct from pure information, though still entirely physical.
mAIry’s room: AI reasoning to solve philosophical problems
This post grew out of a conversation with Laurent Orseau; we were initially going to write a paper for a consciousness/philosophy journal of some sort, but that now seems unlikely, so I thought I’d post the key ideas here.
A summary of this post can be found here—it even has some diagrams.
The central idea is that thinking in terms of AI or similar artificial agent, we can get some interesting solutions to old philosophical problems, such as the Mary’s room/knowledge problem. In essence, simple agents exhibit similar features to Mary in the thought experiments, so (most) explanations of Mary’s experience must also apply to simple artificial agents.
To summarise:
Artificial agents can treat certain inputs as if the input were different from mere information.
This analogises loosely to how humans “experience” certain things.
If the agent is a more limited (and more realistic) design, this analogy can get closer.
There is an artificial version of Mary, mAIry, which would plausibly have something similar to what Mary experiences within the thought experiment.
Edit: See also orthonormal’s sequence here.
Mary’s Room and the Knowledge problem
In this thought experiment, Mary has been confined to a grey room from birth, exploring the outside world only through a black-and-white monitor.
Though isolated, Mary is a brilliant scientist, and has learnt all there is to know about light, the eye, colour theory, human perception, and human psychology. It would seem that she has all possible knowledge that there could be about colour, despite having never seen it.
Then one day she gets out of her room, and says “wow, so that’s what purple looks like!”.
Has she learnt anything new here? If not, what is her exclamation about? If so, what is this knowledge—Mary was supposed to know everything there was to know about colour already?
Incidentally, I chose “purple” as the colour Mary would see, as the two colours most often used, red and blue, lead to the confusion as to what “seeing red/blue” means—is this about the brain, or is it about the cones in the eye. But seeing purple is strictly about perception in the brain.
Example in practice
Interestingly, there are real example of Mary’s room-like situations. Some people with red-green colour-blindness can suddenly start seeing new colours with the right glasses. Apparently this happens because the red and green cones in their eyes are almost identical, so tend to always fire together. But “almost” is not “exactly”, and the glasses force green and red colours apart, so the red and green cones start firing separately, allowing the colour blind to see or distinguish new colours.
Can you feel my pain? The AI’s reward channel
This argument was initially presented here.
AIXI
Let’s start with the least human AI we can imagine: AIXI, which is more an equation than an agent. Because we’ll be imagining multiple agents, let’s pick any computable version of AIXI, such as AIXItl.
There will be two such AIXItl’s, called Ar and Aq, and they will share observations and rewards: at turn i, this will be oi, ri, and qi, with ri the reward of Ar and qi the reward of Aq.
To simplify, we’ll ignore the game theory between the agents; each agent will treat the other as part of the environment and attempt to maximise their reward around this constraint.
Then it’s clear that, even though ri and qi are both part of each agent’s observation, each agent will treat their own reward in a special way. Their actions are geared to increasing their own reward; Ar might find qi informative, but has no use for it beyond that.
For example, Ar might sacrifice current ri to get information that could lead it to increase rj>i; it would never do so to increase qj>i. It would sacrifice all q-rewards to increase the expected sum of ri; indeed it would sacrifice its knowledge of qi entirely to increase that expected sum by the tiniest amount. And Aq would be in the exact opposite situation.
The Ar agent would also do other things, like sacrificing ri in counterfactual universes to increase ri in this one. It would also refuse the following trade: perfect knowledge of the ideal policy that would have maximised expected ri, in exchange for the ri being set to 0 from then on. In other words, it won’t trade ri for perfect information about ri.
So what are these reward channels to these agents? It would go too far to call them qualia, but they do seem to have some features of pleasure/pain in humans. We don’t feel the pleasure and pain of others in the same way we feel our own. We don’t feel counterfactual pain as we feel real pain; and we certainly wouldn’t agree to suffer maximal pain in exchange for knowing how we could have otherwise felt maximal pleasure. Pleasure and pain can motivate us to action in ways that few other things can: we don’t treat them as pure information.
Similarly, the Ar doesn’t treat ri purely as information either. To stretch the definition of a word, we might say that Ar is experiencing ri in ways that it doesn’t experience qi or oi.
Let’s try and move towards a more human-like agent.
TD-Lambda learning
TD stands for temporal difference learning: learning by the difference between a predicted reward and the actual reward. For the TD-Lambda algorithm, the agent uses V(s): the estimated value of the state s. It then goes on its merry way, and as it observes histories of the form …si−1ai−1ri−1siairisi+1ai+1ri+1, it updates is estimate of all its past V(si) (with a discount factor of 0≤λ≤1 for more distant past states sj<i).
Again, imagine there are two agents, Tr and Tq, with separate reward functions r and q, and that each agent gets to see the other’s reward.
What happens when Tr encounters an unexpectedly large or small value of qi? Well, how would it interpret the qi in the first place? Maybe as part of the state-data si+1. In that case, an unexpected qi moves Tr to a new, potentially unusual state si+1, rather than an expected s′i+1. But this is only relevant if V(si+1) is very different from V(s′i+1): in other words, unexpected qi are only relevant if they imply something about expected ri. And even when they do, their immediate impact is rather small: a different state reached.
Compare what happens when Tr encounters an unexpectedly large or small value of ri. The impact of that is immediate: the information percolates backwards, updating all the V(sj<i). There is an immediate change to the inner variables all across the agent’s brain.
In this case, the ‘experience’ of the Tr agent encountering high/low ri resembles our own experience of extreme pleasure/pain: immediate involuntary re-wiring and change of estimates through a significant part of our brain.
We could even give Tr a certain way of ‘knowing’ that high/low ri might be incoming; maybe there’s a reliability score for V(si), or some way of tracking variance in the estimate. Then a low reliability or high variance score could indicate to the Tr that high/low ri might happen (maybe these could feed into the learning rate α). But, even if the magnitude of the ri is not unexpected, it will still cause changes across all the previous estimates—even if these changes are in some sense expected.
mAIry in its room
So we’ve established that artificial agents can treat certain classes of inputs in a special way, “experiencing” their data (for lack of a better word) in a way that is different from simple information. And sometimes these inputs can strongly rewire the agent’s brain/variable values.
Let’s now turn back to the initial thought experiment, and posit that we have a mAIry, an AI version of Mary, similarly brought up without the colour purple. mAIry stores knowledge as weights in a neural net, rather than connections of neurons, but otherwise the thought experiment is very similar.
mAIry knows everything about light, cameras, and how neural nets interpret concepts, including colour. It knows that, for example, “seeing purple” corresponds to a certain pattern of activation in the neural net. We’ll simplify, and just say that there’s a certain node np such that, if its activation reaches a certain threshold, the net has “seen purple”. mAIry is aware of this fact, and can identify the np node within itself, and perfectly predict the sequence of stimuli that could activate it.
If mAIry is still a learning agent, then seeing a new stimuli for the first time is likely to cause a lot of changes in the weights in its nodes; again, these are changes that mAIry can estimate and predict. Let cp be a Boolean corresponding to whether these changes have happened or not.
What dreams of purple may come...
A sufficiently smart mAIry might be able to force itself to “experience” seeing purple, without ever having seen it. If it has full self-modification powers, it could manually activate np and cause the changes that result in cp being true. With more minor abilities, it could trigger some low-level neurons that caused a similar change in its neural net.
In terms of the human Mary, these would correspond to something like self-brain surgery and self-hypnosis (or maybe self-induced dreams of purple).
Coming out of the room: the conclusion
So now assume that mAIry exits the room for the first time, and sees something purple. It’s possible that mAIry has successfully self-modified to activate np and set cp to true. In that case, upon seeing something purple, mAIry gets no extra information, no extra knowledge, and nothing happens in its brain that could correspond to a “wow”.
But what if mAIry has not been able to self-modify? Then upon seeing a purple flower, the node np is strongly activated for the first time, and a whole series of weight changes flow across mAIry’s brain, making cp true.
That is the “wow” moment for mAIry. Both mAIry and Mary have experienced something; something they both perfectly predicted ahead of time, but something that neither could trigger ahead of time, nor prevent from happening when they did see something purple. The novel activation of np and the changes labelled by cp were both predictable and unavoidable for a smart mAIry without self-modification abilities.
At this point the analogy I’m trying to draw should be clear: activating np and the unavoidable changes in the weights that causes cp to be true, are similar to what a TD-Lambda agent goes through when encountering unexpectedly high or low rewards. They are a “mental experience”, unprecedented for the agent even if entirely predictable.
But they are not evidence for epiphenomenalism or against physicalism—unless we want to posit that mAIry is non-physical or epiphenomenal.
It is interesting, though, that this argument suggests that qualia are very real, and distinct from pure information, though still entirely physical.