mAIry’s room: AI reasoning to solve philosophical problems

This post grew out of a con­ver­sa­tion with Lau­rent Orseau; we were ini­tially go­ing to write a pa­per for a con­scious­ness/​philos­o­phy jour­nal of some sort, but that now seems un­likely, so I thought I’d post the key ideas here.

Edit: See also or­thonor­mal’s se­quence here.

The cen­tral idea is that think­ing in terms of AI or similar ar­tifi­cial agent, we can get some in­ter­est­ing solu­tions to old philo­soph­i­cal prob­lems, such as the Mary’s room/​knowl­edge prob­lem. In essence, sim­ple agents ex­hibit similar fea­tures to Mary in the thought ex­per­i­ments, so (most) ex­pla­na­tions of Mary’s ex­pe­rience must also ap­ply to sim­ple ar­tifi­cial agents.

To sum­marise:

  • Ar­tifi­cial agents can treat cer­tain in­puts as if the in­put were differ­ent from mere in­for­ma­tion.

  • This anal­o­gises loosely to how hu­mans “ex­pe­rience” cer­tain things.

  • If the agent is a more limited (and more re­al­is­tic) de­sign, this anal­ogy can get closer.

  • There is an ar­tifi­cial ver­sion of Mary, mAIry, which would plau­si­bly have some­thing similar to what Mary ex­pe­riences within the thought ex­per­i­ment.

Mary’s Room and the Knowl­edge problem

In this thought ex­per­i­ment, Mary has been con­fined to a grey room from birth, ex­plor­ing the out­side world only through a black-and-white mon­i­tor.

Though iso­lated, Mary is a brilli­ant sci­en­tist, and has learnt all there is to know about light, the eye, colour the­ory, hu­man per­cep­tion, and hu­man psy­chol­ogy. It would seem that she has all pos­si­ble knowl­edge that there could be about colour, de­spite hav­ing never seen it.

Then one day she gets out of her room, and says “wow, so that’s what pur­ple looks like!”.

Has she learnt any­thing new here? If not, what is her ex­cla­ma­tion about? If so, what is this knowl­edge—Mary was sup­posed to know ev­ery­thing there was to know about colour already?

In­ci­den­tally, I chose “pur­ple” as the colour Mary would see, as the two colours most of­ten used, red and blue, lead to the con­fu­sion as to what “see­ing red/​blue” means—is this about the brain, or is it about the cones in the eye. But see­ing pur­ple is strictly about per­cep­tion in the brain.

Ex­am­ple in practice

In­ter­est­ingly, there are real ex­am­ple of Mary’s room-like situ­a­tions. Some peo­ple with red-green colour-blind­ness can sud­denly start see­ing new colours with the right glasses. Ap­par­ently this hap­pens be­cause the red and green cones in their eyes are al­most iden­ti­cal, so tend to always fire to­gether. But “al­most” is not “ex­actly”, and the glasses force green and red colours apart, so the red and green cones start firing sep­a­rately, al­low­ing the colour blind to see or dis­t­in­guish new colours.

Can you feel my pain? The AI’s re­ward channel

This ar­gu­ment was ini­tially pre­sented here.


Let’s start with the least hu­man AI we can imag­ine: AIXI, which is more an equa­tion than an agent. Be­cause we’ll be imag­in­ing mul­ti­ple agents, let’s pick any com­putable ver­sion of AIXI, such as AIXItl.

There will be two such AIXItl’s, called and , and they will share ob­ser­va­tions and re­wards: at turn , this will be , , and , with the re­ward of and the re­ward of .

To sim­plify, we’ll ig­nore the game the­ory be­tween the agents; each agent will treat the other as part of the en­vi­ron­ment and at­tempt to max­imise their re­ward around this con­straint.

Then it’s clear that, even though and are both part of each agent’s ob­ser­va­tion, each agent will treat their own re­ward in a spe­cial way. Their ac­tions are geared to in­creas­ing their own re­ward; might find in­for­ma­tive, but has no use for it be­yond that.

For ex­am­ple, might sac­ri­fice cur­rent to get in­for­ma­tion that could lead it to in­crease ; it would never do so to in­crease . It would sac­ri­fice all -re­wards to in­crease the ex­pected sum of ; in­deed it would sac­ri­fice its knowl­edge of en­tirely to in­crease that ex­pected sum by the tiniest amount. And would be in the ex­act op­po­site situ­a­tion.

The agent would also do other things, like sac­ri­fic­ing in coun­ter­fac­tual uni­verses to in­crease in this one. It would also re­fuse the fol­low­ing trade: perfect knowl­edge of the ideal policy that would have max­imised ex­pected , in ex­change for the be­ing set to from then on. In other words, it won’t trade for perfect in­for­ma­tion about .

So what are these re­ward chan­nels to these agents? It would go too far to call them qualia, but they do seem to have some fea­tures of plea­sure/​pain in hu­mans. We don’t feel the plea­sure and pain of oth­ers in the same way we feel our own. We don’t feel coun­ter­fac­tual pain as we feel real pain; and we cer­tainly wouldn’t agree to suffer max­i­mal pain in ex­change for know­ing how we could have oth­er­wise felt max­i­mal plea­sure. Plea­sure and pain can mo­ti­vate us to ac­tion in ways that few other things can: we don’t treat them as pure in­for­ma­tion.

Similarly, the doesn’t treat purely as in­for­ma­tion ei­ther. To stretch the defi­ni­tion of a word, we might say that is ex­pe­rienc­ing in ways that it doesn’t ex­pe­rience or .

Let’s try and move to­wards a more hu­man-like agent.

TD-Lambda learning

TD stands for tem­po­ral differ­ence learn­ing: learn­ing by the differ­ence be­tween a pre­dicted re­ward and the ac­tual re­ward. For the TD-Lambda al­gorithm, the agent uses : the es­ti­mated value of the state . It then goes on its merry way, and as it ob­serves his­to­ries of the form , it up­dates is es­ti­mate of all its past (with a dis­count fac­tor of for more dis­tant past states ).

Again, imag­ine there are two agents, and , with sep­a­rate re­ward func­tions and , and that each agent gets to see the other’s re­ward.

What hap­pens when en­coun­ters an un­ex­pect­edly large or small value of ? Well, how would it in­ter­pret the in the first place? Maybe as part of the state-data . In that case, an un­ex­pected moves to a new, po­ten­tially un­usual state , rather than an ex­pected . But this is only rele­vant if is very differ­ent from : in other words, un­ex­pected are only rele­vant if they im­ply some­thing about ex­pected . And even when they do, their im­me­di­ate im­pact is rather small: a differ­ent state reached.

Com­pare what hap­pens when en­coun­ters an un­ex­pect­edly large or small value of . The im­pact of that is im­me­di­ate: the in­for­ma­tion per­co­lates back­wards, up­dat­ing all the . There is an im­me­di­ate change to the in­ner vari­ables all across the agent’s brain.

In this case, the ‘ex­pe­rience’ of the agent en­coun­ter­ing high/​low re­sem­bles our own ex­pe­rience of ex­treme plea­sure/​pain: im­me­di­ate in­vol­un­tary re-wiring and change of es­ti­mates through a sig­nifi­cant part of our brain.

We could even give a cer­tain way of ‘know­ing’ that high/​low might be in­com­ing; maybe there’s a re­li­a­bil­ity score for , or some way of track­ing var­i­ance in the es­ti­mate. Then a low re­li­a­bil­ity or high var­i­ance score could in­di­cate to the that high/​low might hap­pen (maybe these could feed into the learn­ing rate ). But, even if the mag­ni­tude of the is not un­ex­pected, it will still cause changes across all the pre­vi­ous es­ti­mates—even if these changes are in some sense ex­pected.

mAIry in its room

So we’ve es­tab­lished that ar­tifi­cial agents can treat cer­tain classes of in­puts in a spe­cial way, “ex­pe­rienc­ing” their data (for lack of a bet­ter word) in a way that is differ­ent from sim­ple in­for­ma­tion. And some­times these in­puts can strongly rewire the agent’s brain/​vari­able val­ues.

Let’s now turn back to the ini­tial thought ex­per­i­ment, and posit that we have a mAIry, an AI ver­sion of Mary, similarly brought up with­out the colour pur­ple. mAIry stores knowl­edge as weights in a neu­ral net, rather than con­nec­tions of neu­rons, but oth­er­wise the thought ex­per­i­ment is very similar.

mAIry knows ev­ery­thing about light, cam­eras, and how neu­ral nets in­ter­pret con­cepts, in­clud­ing colour. It knows that, for ex­am­ple, “see­ing pur­ple” cor­re­sponds to a cer­tain pat­tern of ac­ti­va­tion in the neu­ral net. We’ll sim­plify, and just say that there’s a cer­tain node such that, if its ac­ti­va­tion reaches a cer­tain thresh­old, the net has “seen pur­ple”. mAIry is aware of this fact, and can iden­tify the node within it­self, and perfectly pre­dict the se­quence of stim­uli that could ac­ti­vate it.

If mAIry is still a learn­ing agent, then see­ing a new stim­uli for the first time is likely to cause a lot of changes in the weights in its nodes; again, these are changes that mAIry can es­ti­mate and pre­dict. Let be a Boolean cor­re­spond­ing to whether these changes have hap­pened or not.

What dreams of pur­ple may come...

A suffi­ciently smart mAIry might be able to force it­self to “ex­pe­rience” see­ing pur­ple, with­out ever hav­ing seen it. If it has full self-mod­ifi­ca­tion pow­ers, it could man­u­ally ac­ti­vate and cause the changes that re­sult in be­ing true. With more minor abil­ities, it could trig­ger some low-level neu­rons that caused a similar change in its neu­ral net.

In terms of the hu­man Mary, these would cor­re­spond to some­thing like self-brain surgery and self-hyp­no­sis (or maybe self-in­duced dreams of pur­ple).

Com­ing out of the room: the conclusion

So now as­sume that mAIry ex­its the room for the first time, and sees some­thing pur­ple. It’s pos­si­ble that mAIry has suc­cess­fully self-mod­ified to ac­ti­vate and set to true. In that case, upon see­ing some­thing pur­ple, mAIry gets no ex­tra in­for­ma­tion, no ex­tra knowl­edge, and noth­ing hap­pens in its brain that could cor­re­spond to a “wow”.

But what if mAIry has not been able to self-mod­ify? Then upon see­ing a pur­ple flower, the node is strongly ac­ti­vated for the first time, and a whole se­ries of weight changes flow across mAIry’s brain, mak­ing true.

That is the “wow” mo­ment for mAIry. Both mAIry and Mary have ex­pe­rienced some­thing; some­thing they both perfectly pre­dicted ahead of time, but some­thing that nei­ther could trig­ger ahead of time, nor pre­vent from hap­pen­ing when they did see some­thing pur­ple. The novel ac­ti­va­tion of and the changes la­bel­led by were both pre­dictable and un­avoid­able for a smart mAIry with­out self-mod­ifi­ca­tion abil­ities.

At this point the anal­ogy I’m try­ing to draw should be clear: ac­ti­vat­ing and the un­avoid­able changes in the weights that causes to be true, are similar to what a TD-Lambda agent goes through when en­coun­ter­ing un­ex­pect­edly high or low re­wards. They are a “men­tal ex­pe­rience”, un­prece­dented for the agent even if en­tirely pre­dictable.

But they are not ev­i­dence for epiphe­nom­e­nal­ism or against phys­i­cal­ism—un­less we want to posit that mAIry is non-phys­i­cal or epiphe­nom­e­nal.

It is in­ter­est­ing, though, that this ar­gu­ment sug­gests that qualia are very real, and dis­tinct from pure in­for­ma­tion, though still en­tirely phys­i­cal.