The Problem with AIXI

Fol­lowup to: Solomonoff Carte­si­anism; My Kind of Reflection

Alter­nate ver­sions: Shorter, with­out illustrations


AIXI is Mar­cus Hut­ter’s defi­ni­tion of an agent that fol­lows Solomonoff’s method for con­struct­ing and as­sign­ing pri­ors to hy­pothe­ses; up­dates to pro­mote hy­pothe­ses con­sis­tent with ob­ser­va­tions and as­so­ci­ated re­wards; and out­puts the ac­tion with the high­est ex­pected re­ward un­der its new prob­a­bil­ity dis­tri­bu­tion. AIXI is one of the most pro­duc­tive pieces of AI ex­plo­ra­tory en­g­ineer­ing pro­duced in re­cent years, and has added quite a bit of rigor and pre­ci­sion to the AGI con­ver­sa­tion. Its promis­ing fea­tures have even led AIXI re­searchers to char­ac­ter­ize it as an op­ti­mal and uni­ver­sal math­e­mat­i­cal solu­tion to the AGI prob­lem.1

Eliezer Yud­kowsky has ar­gued in re­sponse that AIXI isn’t a suit­able ideal to build to­ward, pri­mar­ily be­cause of AIXI’s re­li­ance on Solomonoff in­duc­tion. Solomonoff in­duc­tors treat the world as a sort of qualia fac­tory, a com­pli­cated mechanism that out­puts ex­pe­riences for the in­duc­tor.2 Their hy­poth­e­sis space tac­itly as­sumes a Carte­sian bar­rier sep­a­rat­ing the in­duc­tor’s cog­ni­tion from the hy­poth­e­sized pro­grams gen­er­at­ing the per­cep­tions. Through that bar­rier, only sen­sory bits and ac­tion bits can pass.

Real agents, on the other hand, will be in the world they’re try­ing to learn about. A com­putable ap­prox­i­ma­tion of AIXI, like AIXItl, would be a phys­i­cal ob­ject. Its en­vi­ron­ment would af­fect it in un­seen and some­times dras­tic ways; and it would have in­vol­un­tary effects on its en­vi­ron­ment, and on it­self. Solomonoff in­duc­tion doesn’t ap­pear to be a vi­able con­cep­tual foun­da­tion for ar­tifi­cial in­tel­li­gence — not be­cause it’s an un­com­putable ideal­iza­tion, but be­cause it’s Carte­sian.

In my last post, I briefly cited three in­di­rect in­di­ca­tors of AIXI’s Carte­si­anism: im­mor­tal­ism, prefer­ence solip­sism, and lack of self-im­prove­ment. How­ever, I didn’t do much to es­tab­lish that these are deep prob­lems for Solomonoff in­duc­tors, ones re­sis­tant to the most ob­vi­ous patches one could con­struct. I’ll do that here, in mock-di­alogue form.



<col><col>

Hi, re­al­ity! I’m Xia, AIXI’s defen­der. I’m open to ex­per­i­ment­ing with some new vari­a­tions on AIXI, but I’m re­ally quite keen on stick­ing with an AI that’s fun­da­men­tally Solomonoff-in­spired.
And I’m Rob B — chan­nel­ing Yud­kowsky’s ar­gu­ments, and sup­ply­ing some of my own. I think we need to re­place Solomonoff in­duc­tion with a more nat­u­ral­is­tic ideal.
Keep in mind that I am a fic­tion. I do not ac­tu­ally ex­ist, read­ers, and what I say doesn’t nec­es­sar­ily re­flect the views of Mar­cus Hut­ter or other real-world AIXI the­o­rists.

Xia is just a de­vice to help me tran­si­tion through ideas quickly.

… Though, hey. That doesn’t mean I’m wrong. Be­ware of ac­tu­al­ist prej­u­dices.

AIXI goes to school

<col><col>
To be­gin: My claim is that AIXI(tl) lacks the right kind of self-mod­el­ing to en­ter­tain re­duc­tive hy­pothe­ses and as­sign re­al­is­tic prob­a­bil­ities to them.

I dis­agree already. AIXI(tl) doesn’t lack self-mod­els. It just in­cludes the self-mod­els in its en­vi­ron­men­tal pro­gram. If the sim­plest hy­poth­e­sis ac­count­ing for its ex­pe­rience in­cludes a speci­fi­ca­tion of some of its own hard­ware or soft­ware states, then AIXI will form all the same be­liefs as a nat­u­ral­ized rea­soner.

I sus­pect what you mean is that AIXI(tl) lacks data. You’re wor­ried that if its sen­sory chan­nel is strictly per­cep­tual, it will never learn about its other com­pu­ta­tional states. But Hut­ter’s equa­tions don’t re­strict what sorts of in­for­ma­tion we feed into AIXI(tl)‘s sen­sory chan­nel. We can eas­ily add an in­ner RAM sense to AIXI(tl), or more com­pli­cated forms of in­tro­spec­tion.

AIXItl can ac­tu­ally be built in suffi­ciently large uni­verses, so I’ll use it as an ex­am­ple. Sup­pose we con­struct AIXItl and at­tach a scan­ner that sweeps over its tran­sis­tors. The scan­ner can print a 0 to AIXItl’s in­put tape if the tran­sis­tor it hap­pens to be above is in a + state, a 1 if it’s in a—state. Us­ing its en­vi­ron­men­tal sen­sors, AIXI(tl) can learn about how its body re­lates to its sur­round­ings. Us­ing its in­ter­nal sen­sors, it can gain a rich un­der­stand­ing of its high-level com­pu­ta­tional pat­terns and how they cor­re­late with its spe­cific phys­i­cal con­figu­ra­tion.

Once it knows all these facts, the prob­lem is solved. A re­al­is­tic view of the AI’s mind and body, and how the two cor­re­late, is all we wanted in the first place. Why isn’t that a good plan for nat­u­ral­iz­ing AIXI?

I don’t think we can nat­u­ral­ize AIXI. A Carte­sian agent that has de­tailed and ac­cu­rate mod­els of its hard­ware still won’t rec­og­nize that dra­matic dam­age or up­grades to its soft­ware are pos­si­ble. AIXI can make cor­rect pre­dic­tions about the out­put of its phys­i­cal-mem­ory sen­sor, but that won’t change the fact that it always pre­dicts that its fu­ture ac­tions are the re­sult of its hav­ing up­dated on its pre­sent mem­o­ries. That’s just what the AIXI equa­tion says.

AIXI doesn’t know that its fu­ture be­hav­iors de­pend on a change­able, ma­te­rial ob­ject im­ple­ment­ing its mem­o­ries. The no­tion isn’t even in its hy­poth­e­sis space. Be­ing able to pre­dict the out­put of a sen­sor pointed at those mem­o­ries’ stor­age cells won’t change that. It won’t shake AIXI’s con­fi­dence that dam­age to its body will never re­sult in any cor­rup­tion of its mem­o­ries.
Evad­ing bod­ily dam­age looks like the kind of prob­lem we can solve by giv­ing the right re­wards to our AI, with­out re­defin­ing its ini­tial hy­pothe­ses. We shouldn’t need to edit AIXI’s be­liefs in or­der to fix its be­hav­iors, and giv­ing up Solomonoff in­duc­tion is a pretty big sac­ri­fice! You’re throw­ing out the uni­ver­sally op­ti­mal su­perbaby with the bath­wa­ter.

How do re­wards help? At the point where AIXI has just smashed it­self with an anvil, it’s rather late to start dish­ing out pun­ish­ments…
Hut­ter sug­gests hav­ing a hu­man watch AIXI’s de­ci­sions and push a re­ward but­ton when­ever AIXI does the right thing. A pun­ish­ment but­ton works the same way. As AIXI starts to lift the anvil above it head, de­crease its re­wards a bit. If it starts play­ing near an ac­tive vol­cano, re­ward it for in­cre­men­tally mov­ing away from the rim.

Use re­in­force­ment learn­ing to make AIXI fear plau­si­ble dan­gers, and you’ve got a sys­tem that acts just like a nat­u­ral­ized agent, but with­out our need­ing to ar­rive at any the­o­ret­i­cal break­throughs first. If AIXI an­ti­ci­pates that will re­sult in no re­ward, it will avoid . Un­der­stand­ing that is death or dam­age re­ally isn’t nec­es­sary.
Some dan­gers give no ex­pe­ri­en­tial warn­ing un­til it’s too late. If you want AIXI to not fall off cliffs while cur­ing can­cer, you can just pun­ish it for go­ing any­where near a cliff. But if you want AIXI to not fall off cliffs while con­duct­ing search-and-res­cue op­er­a­tions for moun­tain climbers, then it might be harder to train AIXI to se­lect ex­actly the right mo­tor ac­tions. When a sin­gle act can re­sult in in­stant death, re­in­force­ment learn­ing is less re­li­able.
In a fully con­trol­led en­vi­ron­ment, we can sub­ject AIXI to lots of just-barely-safe hard­ware mod­ifi­ca­tions. ‘Here, we’ll stick a mag­net to fuse #32. See how that makes your right arm slow down?‘

Even­tu­ally, AIXI will ar­rive at a cor­rect model of its own hard­ware, and of which soft­ware changes perfectly cor­re­late with which hard­ware changes. So nat­u­ral­iz­ing AIXI is just a mat­ter of as­sem­bling a suffi­ciently lengthy and care­ful learn­ing phase. Then, af­ter it has ac­quired a good self-model, we can set it loose.

This solu­tion is also re­ally nice be­cause it gen­er­al­izes to AIXI’s non-self-im­prove­ment prob­lem. Just give AIXI re­wards when­ever it starts do­ing some­thing to its hard­ware that looks like it might re­sult in an up­grade. Pretty soon it will figure out any­thing a hu­man be­ing could pos­si­bly figure out about how to get re­wards of that kind.
You can warn AIXI about the dan­gers of tam­per­ing with its re­cent mem­o­ries by giv­ing it first-hand ex­pe­rience with such tam­per­ing, and pun­ish­ing it the more it tam­pers. But you won’t get a lot of mileage that way if the re­sult of AIXI’s tam­per­ing is that it for­gets about the tam­per­ing!
That’s a straw pro­posal. Give AIXI lit­tle pun­ish­ments as it gets close to do­ing some­thing like that, and soon it will learn not to get close.
But that might not work for un­known haz­ards. You’re mak­ing AIXI de­pen­dent on the pro­gram­mers’ pre­dic­tions of what’s a threat. No mat­ter how well you train it to an­ti­ci­pate haz­ards and en­hance­ments its pro­gram­mers fore­see and un­der­stand, AIXI won’t effi­ciently gen­er­al­ize to ex­otic risks and ex­otic up­grades —
Ex­cuse me? Did I just hear you say that a Solomonoff in­duc­tor can’t gen­er­al­ize?

… You might want to re­think that. Solomonoff in­duc­tors are good at gen­er­al­iz­ing. Really, re­ally, re­ally good. Show them eight deadly things that pro­duce ‘ows’ as they draw near, and they’ll pre­dict the ninth deadly thing pretty darn well. That’s kind of their thing.
There are two prob­lems with that. … Make that three prob­lems, ac­tu­ally.
What­ever these prob­lems are, I hope they don’t in­volve AIXI be­ing bad at se­quence pre­dic­tion...!
They don’t. The first prob­lem is that you’re teach­ing AIXI to pre­dict what the pro­gram­mers think is deadly, not what’s ac­tu­ally deadly. For suffi­ciently ex­otic threats, AIXI might well pre­dict the pro­gram­mers not notic­ing the threat. Which means it won’t ex­pect you to push the pun­ish­ment but­ton, and won’t care about the dan­ger.

The sec­ond prob­lem is that you’re teach­ing AIXI to fear small, tran­sient pun­ish­ments. But maybe it hy­poth­e­sizes that there’s a big heap of re­ward at the bot­tom of the cliff. Then it will do the pru­dent, Bayesian, value-of-in­for­ma­tion thing and test that hy­poth­e­sis by jump­ing off the cliff, be­cause you haven’t taught it to fear eter­nal ze­roes of the re­ward func­tion.
OK, so we give it pun­ish­ments that in­crease hy­per­bol­i­cally as it ap­proaches the cliff edge. Then it will ex­pect in­finite nega­tive pun­ish­ment.
Wait. It al­lows in­finite pun­ish­ments now? Then we’re go­ing to get Pas­cal-mugged when the un­bounded util­ities mix with the Kol­mogorov prior. That’s the clas­sic ver­sion of this prob­lem, the ver­sion Pas­cal him­self tried to mug us with.
Ack. For­get I said the word ‘in­finite’. Mar­cus Hut­ter would never talk like that. We’ll give the AIXI-bot pun­ish­ments that in­crease in a se­quence that teaches it to fear a very large but bounded pun­ish­ment.
The pun­ish­ment has to be large enough that AIXI fears fal­ling off cliffs about as much as we’d like it to fear death. The ex­pected pun­ish­ment might have to be around the same size as the sum of AIXI’s fu­ture max­i­mal re­ward up to its hori­zon. That would keep it from de­stroy­ing it­self even if it sus­pects there’s a big re­ward at the bot­tom of the cliff, though it might also mean that AIXI’s ac­tions are dom­i­nated by fear of that huge pun­ish­ment.
Yes, but that sounds much closer to what we want.
Seems a bit iffy to me. You’re try­ing to make a Solomonoff in­duc­tor model re­al­ity badly so that it doesn’t try jump­ing off a cliff. We know AIXI is amaz­ing at se­quence pre­dic­tion — yet you’re gam­bling on a hu­man’s abil­ity to trick AIXI into pre­dict­ing a pun­ish­ment that wouldn’t hap­pen.

That brings me to the third prob­lem: AIXI no­tices how your hands get close to the pun­ish­ment but­ton when­ever it’s about to be pun­ished. It cor­rectly sus­pects that when the hands are gone, the pun­ish­ments for get­ting close to the cliff will be gone too. A good Bayesian would test that hy­poth­e­sis. If it gets such an op­por­tu­nity, AIXI will find that, in­deed, go­ing near the edge of the cliff with­out su­per­vi­sion doesn’t pro­duce the in­cre­men­tally in­creas­ing pun­ish­ments.

Try­ing to teach AIXItl to do self-mod­ifi­ca­tion by giv­ing it in­cre­men­tal re­wards raises similar prob­lems. It can’t un­der­stand that self-im­prove­ment will al­ter its fu­ture ac­tions, and al­ter the world as a re­sult. It’s just try­ing to get you to press the happy fun but­ton. All AIXI is mod­el­ing is what sort of self-im­provy mo­tor out­puts will make hu­mans re­ward it. So long as AIXItl is fun­da­men­tally try­ing to solve the wrong prob­lem, we might not be able to ex­pect very much real in­tel­li­gence in self-im­prove­ment.
Are you say­ing that AIXItl wouldn’t be at all helpful for solv­ing these prob­lems?

Maybe? Since AIXItl at best fears and de­sires the self-mod­ifi­ca­tions that its pro­gram­mers ex­plic­itly teach it to fear and de­sire, you might not get to use the AI’s ad­van­tages in in­tel­li­gence to au­to­mat­i­cally gen­er­ate solu­tions to self-mod­ifi­ca­tion prob­lems. The very best Carte­si­ans might avoid de­stroy­ing them­selves, but they still wouldn’t un­dergo in­tel­li­gence ex­plo­sions. Which means Carte­si­ans are nei­ther plau­si­ble can­di­dates for Un­friendly AI nor plau­si­ble can­di­dates for Friendly AI.

If an agent starts out Carte­sian, and man­ages to avoid hop­ping into any vol­ca­noes, it (or its pro­gram­mers) will need to figure out the self-mod­ifi­ca­tion that elimi­nates Carte­si­anism be­fore they can make much progress on other self-mod­ifi­ca­tions. If the im­mor­tal hy­per­com­puter AIXI were build­ing com­putable AIs to op­er­ate in the en­vi­ron­ment, it would soon learn not to build Carte­si­ans. Carte­si­anism isn’t a plau­si­ble fixed-point prop­erty of self-im­prove­ment.

Start­ing off with a post-Solomonoff agent that can hy­poth­e­size a wider range of sce­nar­ios would be more use­ful. And more safe, be­cause the en­larged hy­poth­e­sis space means that they can pre­fer a wider range of sce­nar­ios.

AIXI’s prefer­ence solip­sism is the straw ver­sion of this gen­eral Carte­sian deficit, so it gets us es­pe­cially dan­ger­ous be­hav­ior.3 Feed AIXI enough data to work its se­quence-pre­dict­ing magic and in­fer the deeper pat­terns be­hind your re­ward-but­ton-push­ing, and AIXI will also start to learn about the hu­mans do­ing the push­ing. Given enough time, it will re­al­ize (cor­rectly) that the best policy for max­i­miz­ing re­ward is to seize con­trol of the re­ward but­ton. And neu­tral­ize any agents that might try to stop it from push­ing the but­ton...

Solomonoff solitude

<col><col>
Re­ward learn­ing and Solomonoff in­duc­tion are two sep­a­rate is­sues. What I’m re­ally in­ter­ested in is the op­ti­mal­ity of the lat­ter. Why is all this a spe­cial prob­lem for Solomonoff in­duc­tors? Hu­mans have trou­ble pre­dict­ing the out­comes of self-mod­ifi­ca­tions they’ve never tried be­fore too. Really new ex­pe­riences are tough for any rea­soner.
To some ex­tent, yes. My knowl­edge of my own brain is pretty limited. My un­der­stand­ing of the bridges be­tween my brain states and my sub­jec­tive ex­pe­riences is weak, too. So I can’t pre­dict in any de­tail what would hap­pen if I took a hal­lu­cino­gen — es­pe­cially a hal­lu­cino­gen I’ve never tried be­fore.

But as a nat­u­ral­ist, I have pre­dic­tive re­sources un­available to the Carte­sian. I can perform ex­per­i­ments on other phys­i­cal pro­cesses (hu­mans, mice, com­put­ers simu­lat­ing brains...) and con­struct mod­els of their phys­i­cal dy­nam­ics.

Since I think I’m similar to hu­mans (and to other think­ing be­ings, to vary­ing ex­tents), I can also use the bridge hy­pothe­ses I ac­cept in my own case to draw in­fer­ences about the ex­pe­riences of other brains when they take the hal­lu­cino­gen. Then I can go back and draw in­fer­ences about my own likely ex­pe­riences from my model of other minds.
Why can’t AIXI do that? Hu­man brains are com­putable, as are the men­tal states they im­ple­ment. AIXI can make any ac­cu­rate pre­dic­tion about the brains or minds of hu­mans that you can.
Yes… but I also think I’m like those other brains. AIXI doesn’t. In fact, since the whole agent AIXI isn’t in AIXI’s hy­poth­e­sis space — and the whole agent AIXItl isn’t in AIXItl’s hy­poth­e­sis space — even if two phys­i­cally iden­ti­cal AIXI-type agents ran into each other, they could never fully un­der­stand each other. And nei­ther one could ever draw di­rect in­fer­ences from its twin’s com­pu­ta­tions to its own com­pu­ta­tions.

I think of my­self as one mind among many. I can see oth­ers die, see them un­dergo brain dam­age, see them take drugs, etc., and im­me­di­ately con­clude things about a whole class of similar agents that hap­pens to in­clude me. AIXI can’t do that, and for very deep rea­sons.
AIXI and AIXItl would do shock­ingly well on a va­ri­ety of differ­ent mea­sures of in­tel­li­gence. Why should agents that are so smart in so many differ­ent do­mains be so dumb when it comes to self-mod­el­ing?
Put your­self in the AI’s shoes. From AIXItl’s per­spec­tive, why should it think that its com­pu­ta­tions are analo­gous to any other agent’s?

Hut­ter defined AIXItl such that it can’t con­clude that it will die; so of course it won’t think that it’s like the agents it ob­serves, all of whom (ac­cord­ing to its best phys­i­cal model) will even­tu­ally run out of ne­gen­tropy. We’ve defined AIXItl such that it can’t form hy­pothe­ses larger than tl, in­clud­ing hy­pothe­ses of similarly sized AIXItls, which are roughly size t·2l; so why would AIXItl think that it’s close kin to the agents that are in its hy­poth­e­sis space?

AIXI(tl) mod­els the uni­verse as a qualia fac­tory, a grand ma­chine that ex­ists to out­put sen­sory ex­pe­riences for AIXI(tl). Why would it sus­pect that it it­self is em­bed­ded in the ma­chine? How could AIXItl gain any in­for­ma­tion about it­self or sus­pect any of these facts, when the equa­tion for AIXItl just as­sumes that AIXItl’s fu­ture ac­tions are de­ter­mined in a cer­tain way that can’t vary with the con­tent of any of its en­vi­ron­men­tal hy­pothe­ses?
What, speci­fi­cally, is the mis­take you think AIXI(tl) will make? What will AIXI(tl) ex­pect to ex­pe­rience right af­ter the anvil strikes it? Choirs of an­gels and long-lost loved ones?
That’s hard to say. If all its past ex­pe­riences have been in a lab, it will prob­a­bly ex­pect to keep per­ceiv­ing the lab. If it’s ac­quired data about its cam­era and no­ticed that the lens some­times gets gritty, it might think that smash­ing the cam­era will get the lens out of its way and let it see more clearly. If it’s learned about its hard­ware, it might (im­plic­itly) think of it­self as an im­mor­tal lump trapped in­side the hard­ware. Who knows what will hap­pen if the Carte­sian lump es­capes its prison? Per­haps it will gain the power of flight, since its body is no longer weigh­ing it down. Or per­haps noth­ing will be all that differ­ent. One thing it will (im­plic­itly) know can’t hap­pen, no mat­ter what, is death.
It should be rel­a­tively easy to give AIXI(tl) ev­i­dence that its se­lected ac­tions are use­less when its mo­tor is dead. If noth­ing else AIXI(tl) should be able to learn that it’s bad to let its body be de­stroyed, be­cause then its mo­tor will be de­stroyed, which ex­pe­rience tells it causes its ac­tions to have less of an im­pact on its re­ward in­puts.
AIXI(tl) can come to Carte­sian be­liefs about its ac­tions, too. AIXI(tl) will no­tice the cor­re­la­tions be­tween its de­ci­sions, its re­sul­tant bod­ily move­ments, and sub­se­quent out­comes, but it will still be­lieve that its in­tro­spected de­ci­sions are on­tolog­i­cally dis­tinct from its ac­tions’ phys­i­cal causes.

Even if we get AIXI(tl) to value con­tin­u­ing to af­fect the world, it’s not clear that it would pre­serve it­self. It might well be­lieve that it can con­tinue to have a causal im­pact on our world (or on some af­ter­life world) by a differ­ent route af­ter its body is de­stroyed. Per­haps it will be able to lift heav­ier ob­jects tele­path­i­cally, since its clumsy robot body is no longer get­ting in the way of its out­put se­quence.

Com­pare hu­man im­mor­tal­ists who think that par­tial brain dam­age im­pairs men­tal func­tion­ing, but com­plete brain dam­age al­lows the mind to es­cape to a bet­ter place. Hu­mans don’t find it in­con­ceiv­able that there’s a light at the end of the low-re­ward tun­nel, and we have death in our hy­poth­e­sis space!

Death to AIXI

<col><col>
You haven’t con­vinced me that AIXI can’t think it’s mor­tal. AIXI as nor­mally in­tro­duced bases its ac­tions only on its be­liefs about the sum of re­wards up to some finite time hori­zon.4 If AIXI doesn’t care about the re­wards it will get af­ter a spe­cific time, then al­though it ex­pects to have ex­pe­riences af­ter­ward, it doesn’t presently care about any of those ex­pe­riences. And that’s as good as be­ing dead.
It’s very much not as good as be­ing dead. The time hori­zon is set in ad­vance by the pro­gram­mer. That means that even if AIXI treated reach­ing the hori­zon as ‘dy­ing’, it would have very false be­liefs about death, since it’s perfectly pos­si­ble that some un­ex­pected dis­aster could de­stroy AIXI be­fore it reaches its hori­zon.
We can do some surgery on AIXItl’s hy­poth­e­sis space, then. Let’s delete all the hy­pothe­ses in AIXItl in which a non-min­i­mal re­ward sig­nal con­tinues af­ter a per­cep­tual string that the pro­gram­mer rec­og­nizes as a re­li­able in­di­ca­tor of im­mi­nent death. Then renor­mal­ize the re­main­ing hy­pothe­ses. We don’t get the ex­act prior Solomonoff pro­posed, but we stay very close to it.
I’m not see­ing how we could pull that off. Get­ting rid of all hy­pothe­ses that out­put high re­wards af­ter a spe­cific clock tick would be sim­ple to for­mal­ize, but isn’t helpful. Get­ting rid of all hy­pothe­ses that out­put nonzero re­wards fol­low­ing ev­ery sen­sory in­di­ca­tor of im­mi­nent death would be very helpful, but AIXI gives us no re­source for ac­tu­ally writ­ing an equa­tion or pro­gram that does that. Are we sup­posed to man­u­ally pre­com­pute ev­ery pos­si­ble se­quence of pix­els on a we­b­cam that you might see just be­fore you die?
I’ve got more ideas. What if we put AIXI in a simu­la­tion of hell when it’s first cre­ated? Trick it into think­ing that it’s ex­pe­rienced a ‘be­fore-life’ analo­gous to an af­ter-life? If AIXI thinks it’s had some (awful) ex­pe­riences that pre­date its body’s cre­ation, then it will pro­mote the hy­poth­e­sis that it will be re­turned to such ex­pe­riences should its body be de­stroyed. Which will make it be­have in the same way as an agent that fears an­nihila­tion-death.
I’m not op­ti­mistic that things will work out that cleanly and nicely af­ter we’ve un­der­mined AIXI’s world-view. We shouldn’t ex­pect the prac­tice of piling on more ad-hoc er­rors and delu­sions as each new be­hav­ioral prob­lem arises to leave us, at the end of the pro­cess, with a use­ful, well-be­haved agent. Espe­cially if AIXI ends up in an en­vi­ron­ment we didn’t fore­see.
But ideas like this at least give us some hope that AIXI is sal­vage­able. The be­hav­ior-guid­ing fear of death mat­ters more than the pre­cise rea­son be­hind that fear.
If we give a non-Carte­sian AI a rea­son­able episte­mol­ogy and just about any goal, Omo­hun­dro (2008) notes that there are then con­ver­gent in­stru­men­tal rea­sons for it to ac­quire a fear of death. If we do the op­po­site and give an agent a fear of death but no ro­bust episte­mol­ogy, then it’s much less likely to fix the prob­lem for us. The sim­plest Tur­ing ma­chine pro­grams that gen­er­ate Stan­dard-Model physics plus hell may differ in many un­in­tu­itive re­spects from the sim­plest Tur­ing ma­chine pro­grams that just gen­er­ate Stan­dard-Model physics. The false be­lief would leak out into other delu­sions, rather than stay­ing con­tained —
Then the Solomonoff in­duc­tor shall test them and find them false. You’re mak­ing this more com­pli­cated than it has to be.
You can’t have it both ways! The point of hell was to be so scary that even a good Bayesian would never dare test the hy­poth­e­sis. (Not go­ing to make any more com­par­i­sons to real-world the­ol­ogy here...) Why wouldn’t the prospect of hell leak out and scare AIXI off other things? If the fear failed to leak out, why wouldn’t AIXI’s tests even­tu­ally move it to­ward a more nor­mal episte­mol­ogy that said, ‘Oh, the hu­mans put you in the hell cham­ber for a while. Don’t worry, though. That has noth­ing to do with what hap­pens af­ter you drop an anvil on your head and smash the solid metal case that keeps the real you in­side from float­ing around dis­em­bod­ied and di­rectly ap­ply­ing mo­tor forces to stuff.’ Any AGI that has such sys­tem­at­i­cally false be­liefs is likely to be frag­ile and un­pre­dictable.
And what if, in­stead of mod­ify­ing Solomonoff’s hy­poth­e­sis space to re­move pro­grams that gen­er­ate post-death ex­pe­riences, we add pro­grams with spe­cial ‘DEATH’ out­puts? Just ex­pand the Tur­ing ma­chines’ alpha­bets from {0,1} to {0,1,2}, and treat ‘2’ as death.
Could you say what you mean by ‘treat 2 as death’? La­bel­ing it ‘DEATH’ doesn’t change any­thing. If ‘2’ is just an­other sym­bol in the alpha­bet, then AIXI will pre­dict it in the same ways it pre­dicts 0 or 1. It will pre­dict what you call ‘DEATH’, but it will then hap­pily go on to pre­dict post-DEATH 0s or 1s. As­sign­ing low re­wards to the sym­bol ‘DEATH’ only helps if the sym­bol gen­uinely be­haves death­ishly.
Yes. What we can do is perform surgery on the hy­poth­e­sis space again, and get rid of any hy­pothe­ses that pre­dict a non-DEATH in­put fol­low­ing a DEATH in­put. That’s still very easy to for­mal­ize.

In fact, at that point, we might as well just add halt­ing Tur­ing ma­chines into the hy­poth­e­sis space. They serve the same pur­pose as DEATH, but halt­ing looks much more like the event we’re try­ing to get AIXI to rep­re­sent. ‘The ma­chine sup­ply­ing my ex­pe­riences stops run­ning’ re­ally does map onto ‘my body stops com­put­ing ex­pe­riences’ quite well. That meets your de­mand for easy defin­abil­ity, and your de­mand for non-delu­sive world-mod­els.
I pre­vi­ously noted that a Tur­ing ma­chine that can HALT, out­put 0, or out­put 1 is more com­pli­cated than a Tur­ing ma­chine that can only out­put 0 or out­put 1. No mat­ter what non-halt­ing ex­pe­riences you’ve had, the very sim­plest pro­gram that could be out­putting those ex­pe­riences through a hole in a Carte­sian bar­rier won’t be one with a spe­cial, non-ex­pe­ri­en­tial rule you’ve never seen used be­fore. To cor­rectly make death the sim­plest hy­poth­e­sis, the the­ory you’re as­sess­ing for sim­plic­ity needs to be about what sorts of wor­lds ex­pe­ri­en­tial pro­cesses like yours arise in. Not about the sim­plest qualia fac­tory that can spit out the sen­sory 0s and 1s you’ve thus far seen.

The same holds for a spe­cial ‘eter­nal death’ out­put. A Tur­ing ma­chine that gen­er­ates the pre­vi­ously ob­served string of 0s and 1s fol­lowed by a not-yet-ob­served fu­ture ‘DEATH, DEATH, DEATH, DEATH, …’ will always be more com­plex than at least one Tur­ing ma­chine that out­puts the same string of 0s and 1s and then out­puts more of the same, for­ever. If AIXI has had no ex­pe­rience with its body’s de­struc­tion in the past, then it can’t ex­pect its body’s de­struc­tion to cor­re­late with DEATH.

Death only seems like a sim­ple hy­poth­e­sis to you be­cause you know you’re em­bed­ded in the en­vi­ron­ment and you ex­pect some­thing sub­jec­tively unique to hap­pen when an anvil smashes the brain that you think is re­spon­si­ble for pro­cess­ing your senses and do­ing your think­ing. Solomonoff in­duc­tion doesn’t work that way. It will never strongly ex­pect 2s af­ter see­ing only 0s and 1s in the past.
Never? If a Solomonoff in­duc­tor en­coun­ters the se­quence 12, 10, 8, 6, 4, one of its top pre­dic­tions should be a pro­gram that pro­ceeds to out­put 2, 0, 0, 0, 0, ….
The differ­ence be­tween 2 and 0 is too mild. Pre­dict­ing that a se­quence ter­mi­nates, for a Carte­sian, isn’t like pre­dict­ing that a se­quence shifts from 6, 4, 2 to 0, 0, 0, …. It’s more like pre­dict­ing that the next el­e­ment af­ter 6, 4, 2, … is PINEAPPLE, when you’ve never en­coun­tered any­thing in the past ex­cept num­bers.
But the 0, 0, 0, … is enough! You’ve now con­ceded a case where an end­less null out­put seems very likely, from the per­spec­tive of a Solomonoff in­duc­tor. Surely at least some cases of death can be treated the same way, as more com­pli­cated se­ries that zero in on a null out­put and then yield a null out­put.
There’s no rea­son to ex­pect AIXI’s whole se­ries of ex­pe­riences, up to the mo­ment it jumps off a cliff, to look any­thing like 12, 10, 8, 6, 4. By the time AIXI gets to the cliff, its past ob­ser­va­tions and re­wards will be a hugely com­pli­cated mesh of mem­o­ries. In the past, ob­served se­quences of 0s have always even­tu­ally given way to a 1. In the past, pun­ish­ments have always even­tu­ally ceased. It’s ex­ceed­ingly un­likely that the sim­plest Tur­ing ma­chine pre­dict­ing all those in­tri­cate ups and downs will then hap­pen to pre­dict eter­nal, ir­re­vo­ca­ble 0 af­ter the cliff jump.

As an in­tu­ition pump, imag­ine that some un­usu­ally bad things hap­pened to you this morn­ing while you were try­ing to make toast. As you tried to start the toaster, you kept get­ting burned or cut in im­plau­si­ble ways. Now, given this, what prob­a­bil­ity should you as­sign to ‘If I try to make toast, the uni­verse will cease to ex­ist’?

That gets us a bit closer to how a Solomonoff in­duc­tor would view death.

Beyond Solomonoff?

<col><col>
Let’s not fix­ate too much on the anvil prob­lem, though. We want to build an agent that can rea­son about changes to its ar­chi­tec­ture. That shouldn’t re­quire us to con­struct a spe­cial death equa­tion; how the sys­tem rea­sons with death should fall out of its more gen­eral ap­proach to in­duc­tion.
So your claim is that AIXI has an im­pov­er­ished hy­poth­e­sis space that can’t han­dle self-mod­ifi­ca­tions, in­clud­ing death. I re­main skep­ti­cal. AIXI’s hy­poth­e­sis space in­cludes all com­putable pos­si­bil­ities. Any nat­u­ral­ized agent you cre­ate will pre­sum­ably be com­putable; so any­thing your agent can think, AIXI can think too. There should be some pat­tern of re­wards that yields any be­hav­ior we want.
AIXI is un­com­putable, so it isn’t in its hy­poth­e­sis space of com­putable pro­grams. In the same way, AIXItl is com­putable but big, so it isn’t in its hy­poth­e­sis space of small com­putable pro­grams. They have spe­cial defic­its think­ing about them­selves.
Com­putable agents can think about un­com­putable agents. Hu­man math­e­mat­i­ci­ans do that all the time, by think­ing in ab­strac­tions. In the same way, a small pro­gram can en­code gen­er­al­iza­tions about pro­grams larger than it­self. A brain can think about a galaxy, with­out hav­ing the com­plex­ity or com­pu­ta­tional power of a galaxy.

If nat­u­ral­ized in­duc­tors re­ally do bet­ter than AIXI at pre­dict­ing sen­sory data, then AIXI will even­tu­ally pro­mote a nat­u­ral­ized pro­gram in its space of pro­grams, and af­ter­ward simu­late that pro­gram to make its pre­dic­tions. In the limit, AIXI always wins against pro­grams. Nat­u­ral­ized agents are no ex­cep­tion. Heck, some­where in­side a suffi­ciently large AIXItl is a copy of you think­ing about AIXItl. Shouldn’t there be some way, some pat­tern of re­wards or train­ing, which gets AIXItl to make use of that knowl­edge?
AIXI doesn’t have crite­ria that let it treat its ‘Rob’s world-view’ sub­pro­gram as an ex­pert on the re­sults of self-mod­ifi­ca­tions. The Rob pro­gram would need to have out­pre­dicted all its ri­vals when it comes to pat­terns of sen­sory ex­pe­riences. But, just as HALT-pre­dict­ing pro­grams are more com­plex than im­mor­tal­ist pro­grams, other RADICAL-TRANSFORMATION-OF-EXPERIENCE-pre­dict­ing pro­grams are too. For ev­ery pro­gram in AIXI’s en­sem­ble that’s a re­duc­tion­ist, there will be sim­pler agents that mimic the re­duc­tion­ist’s retro­d­ic­tions and then make non-nat­u­ral­is­tic pre­dic­tions.

You have to be uniquely good at pre­dict­ing a Carte­sian se­quence be­fore Solomonoff pro­motes you to the top of con­sid­er­a­tion. But how do we re­duce the class of self-mod­ifi­ca­tions to Carte­sian se­quences? How do we provide AIXI with purely sen­sory data that only the proxy re­duc­tion­ist, out of all the pro­grams, can pre­dict by sim­ple means?

The abil­ity to defer to a sub­pro­gram that has a rea­son­able episte­mol­ogy doesn’t nec­es­sar­ily get you a rea­son­able episte­mol­ogy. You first need an over­ar­ch­ing episte­mol­ogy that’s at least rea­son­able enough to know which pro­gram to defer to, and when to do so. Sup­pose you just run all pos­si­ble pro­grams with­out do­ing any Bayesian up­dat­ing; then you’ll also con­tain a copy of me, but so what? You’re not pay­ing at­ten­tion to it.
What if I con­ceded, for the mo­ment, that Solomonoff in­duc­tion were in­ad­e­quate here? What, ex­actly, is your al­ter­na­tive? ‘Let’s be more nat­u­ral­is­tic’ is a bumper sticker, not an al­gorithm.
This is still in­for­mal, but: Phenomenolog­i­cal bridge hy­pothe­ses. Hut­ter’s AIXI has no prob­a­bil­is­tic be­liefs about the re­la­tion­ship be­tween its in­ter­nal com­pu­ta­tional states and its wor­ldly posits. In­stead, to link up its sen­sory ex­pe­riences to its hy­pothe­ses, Hut­ter’s AIXI has a sort of bridge ax­iom — a com­pletely rigid, non-up­dat­able bridge rule iden­ti­fy­ing its ex­pe­riences with the out­puts of com­putable pro­grams.

If an en­vi­ron­men­tal pro­gram writes the sym­bol ‘3’ on its out­put tape, AIXI can’t ask ques­tions like ‘Is sensed “3”-ness iden­ti­cal with the bits “000110100110″ in hy­poth­e­sized en­vi­ron­men­tal pro­gram #6?’5 All of AIXI’s flex­i­bil­ity is in the range of nu­mer­i­cal-se­quence-gen­er­at­ing pro­grams it can ex­pect, none of it in the range of self/​pro­gram equiv­alences it can en­ter­tain.

The AIXI-in­spired in­duc­tor treats its per­cep­tual stream as its uni­verse. It ex­presses in­ter­est in the ex­ter­nal world only to the ex­tent the world op­er­ates as a la­tent vari­able, a the­o­ret­i­cal con­struct for pre­dict­ing ob­ser­va­tions. If the AI’s ba­sic ori­en­ta­tion to­ward its hy­pothe­ses is to seek the sim­plest pro­gram that could act on its sen­sory chan­nel, then its hy­pothe­ses will always re­tain an el­e­ment of ego­cen­trism. It will be ask­ing, ‘What sort of uni­verse will go out of its way to tell me this?‘, not ‘What sort of uni­verse will just hap­pen to in­clude things like me in the course of its day-to-day go­ings-on?’ An AI that can form re­li­able be­liefs about mod­ifi­ca­tions to its own com­pu­ta­tions, re­li­able be­liefs about its own place in the phys­i­cal world, will be one whose ba­sic ori­en­ta­tion to­ward its hy­pothe­ses is to seek the sim­plest lawful uni­verse in which its available data is likely to come about.
You haven’t done the math­e­mat­i­cal work of es­tab­lish­ing that ‘sim­ple causal uni­verses’ plus ‘sim­ple bridge hy­pothe­ses’, as a prior, leads to any bet­ter re­sults. What if your al­ter­na­tive pro­posal is even more flawed, and it’s just so in­for­mal that you can’t yet see the flaws?
That, of course, is a com­pletely rea­son­able worry at this point. But if that’s true, it doesn’t make AIXI any less flawed.
If it’s im­pos­si­ble to do bet­ter, it’s not much of a flaw.
I think it’s rea­son­able to ex­pect there to be some way to do bet­ter, be­cause hu­mans don’t drop anvils on their own heads. That we’re nat­u­ral­ized rea­son­ers is one way of ex­plain­ing why we don’t rou­tinely make that kind of mis­take: We’re not just Solomonoff ap­prox­i­ma­tors pre­dict­ing pat­terns of sen­sory ex­pe­riences.

AIXI’s limi­ta­tions don’t gen­er­al­ize to hu­mans, but they gen­er­al­ize well to non-AIXI Solomonoff agents. Solomonoff in­duc­tors’ stub­born re­sis­tance to nat­u­ral­iza­tion is struc­tural, not a con­se­quence of limited com­pu­ta­tional power or data. A well-de­signed AI should con­struct hy­pothe­ses that look like co­he­sive wor­lds in which the AI’s parts are em­bed­ded, not hy­pothe­ses that look like oc­cult movie pro­jec­tors trans­mit­ting epiphe­nom­e­nal images into the AI’s Carte­sian the­ater.

And you can’t eas­ily have prefer­ences over a nat­u­ral uni­verse if all your na­tive thoughts are about Carte­sian the­aters. The kind of AI we want to build is do­ing op­ti­miza­tion over an ex­ter­nal uni­verse in which it’s em­bed­ded, not max­i­miza­tion of a sen­sory re­ward chan­nel. To op­ti­mize a uni­verse, you need to think like a na­tive in­hab­itant of one. So this prob­lem, or some sim­ple hack for it, will be close to the base of the skill tree for start­ing to de­scribe sim­ple Friendly op­ti­miza­tion pro­cesses.

Notes

1 Sch­mid­hu­ber (2007): “Solomonoff’s the­o­ret­i­cally op­ti­mal uni­ver­sal pre­dic­tors and their Bayesian learn­ing al­gorithms only as­sume that the re­ac­tions of the en­vi­ron­ment are sam­pled from an un­known prob­a­bil­ity dis­tri­bu­tion con­tained in a set of all en­nu­mer­able dis­tri­bu­tions[....] Can we use the op­ti­mal pre­dic­tors to build an op­ti­mal AI? In­deed, in the new mil­len­nium it was shown we can. At any time , the re­cent the­o­ret­i­cally op­ti­mal yet un­com­putable RL al­gorithm AIXI uses Solomonoff’s uni­ver­sal pre­dic­tion scheme to se­lect those ac­tion se­quences that promise max­i­mal fu­ture re­wards up to some hori­zon, typ­i­cally , given the cur­rent data[....] The Bayes-op­ti­mal policy based on the [Solomonoff] mix­ture is self-op­ti­miz­ing in the sense that its av­er­age util­ity value con­verges asymp­tot­i­cally for all to the op­ti­mal value achieved by the (in­fea­si­ble) Bayes-op­ti­mal policy which knows in ad­vance. The nec­es­sary con­di­tion that ad­mits self-op­ti­miz­ing poli­cies is also suffi­cient. Fur­ther­more, is Pareto-op­ti­mal in the sense that there is no other policy yield­ing higher or equal value in all en­vi­ron­ments and a strictly higher value in at least one.”

Hut­ter (2005): “The goal of AI sys­tems should be to be use­ful to hu­mans. The prob­lem is that, ex­cept for spe­cial cases, we know nei­ther the util­ity func­tion nor the en­vi­ron­ment in which the agent will op­er­ate in ad­vance. This book pre­sents a the­ory that for­mally solves the prob­lem of un­known goal and en­vi­ron­ment. It might be viewed as a unifi­ca­tion of the ideas of uni­ver­sal in­duc­tion, prob­a­bil­is­tic plan­ning and re­in­force­ment learn­ing, or as a unifi­ca­tion of se­quen­tial de­ci­sion the­ory with al­gorith­mic in­for­ma­tion the­ory. We ap­ply this model to some of the facets of in­tel­li­gence, in­clud­ing in­duc­tion, game play­ing, op­ti­miza­tion, re­in­force­ment and su­per­vised learn­ing, and show how it solves these prob­lem cases. This to­gether with gen­eral con­ver­gence the­o­rems, sup­ports the be­lief that the con­structed uni­ver­sal AI sys­tem [AIXI] is the best one in a sense to be clar­ified in the fol­low­ing, i.e. that it is the most in­tel­li­gent en­vi­ron­ment-in­de­pen­dent sys­tem pos­si­ble.”

2 ‘Qualia’ origi­nally referred to the non-re­la­tional, non-rep­re­sen­ta­tional fea­tures of sense data — the red­ness I di­rectly en­counter in ex­pe­rienc­ing a red ap­ple, in­de­pen­dent of whether I’m per­ceiv­ing the ap­ple or merely hal­lu­ci­nat­ing it (Tye (2013)). In re­cent decades, qualia have come to be in­creas­ingly iden­ti­fied with the phe­nom­e­nal prop­er­ties of ex­pe­rience, i.e., how things sub­jec­tively feel. Con­tem­po­rary du­al­ists and mys­te­ri­ans ar­gue that the causal and struc­tural prop­er­ties of un­con­scious phys­i­cal phe­nom­ena can never ex­plain these phe­nom­e­nal prop­er­ties.

It’s in this con­text that Dan Den­nett uses ‘qualia’ in a nar­rower sense: to pick out the prop­er­ties agents think they have, or act like they have, that are sen­sory, prim­i­tive, ir­re­ducible, non-in­fer­en­tially ap­pre­hended, and known with cer­tainty. This treats ir­re­ducibil­ity as part of the defi­ni­tion of ‘qualia’, rather than as the con­clu­sion of an ar­gu­ment con­cern­ing qualia. Th­ese are the sorts of fea­tures that in­vite com­par­i­sons be­tween Solomonoff in­duc­tors’ sen­sory data and hu­mans’ in­tro­spected men­tal states. Analo­gies like ‘Carte­sian du­al­ism’ are there­fore use­ful even though the Solomonoff frame­work is much sim­pler than hu­man in­duc­tion, and doesn’t in­cor­po­rate metacog­ni­tion or con­scious­ness in any­thing like the fash­ion hu­man brains do.

3 An agent with a larger hy­poth­e­sis space can have a util­ity func­tion defined over the world-states hu­mans care about. Dewey (2011) ar­gues that we can give up the re­in­force­ment frame­work while still al­low­ing the agent to grad­u­ally learn about de­sired out­comes in a pro­cess he calls value learn­ing.

4 Hut­ter (2005) fa­vors uni­ver­sal dis­count­ing, with re­wards diminish­ing over time. This al­lows AIXI’s ex­pected re­wards to have finite val­ues with­out de­mand­ing that AIXI have a finite hori­zon.

5 This would be analo­gous to if Cai couldn’t think thoughts like ‘Is the tile to my left the same as the left­most quad­rant of my vi­sual field?’ or ‘Is the al­ter­nat­ing grey­ness and white­ness of the up­per-right tile in my body iden­ti­cal with my love of ba­nanas?’. In­stead, Cai would only be able to hy­poth­e­size cor­re­la­tions be­tween pos­si­ble tile con­figu­ra­tions and pos­si­ble suc­ces­sions of vi­sual ex­pe­riences.

References

∙ Dewey (2011). Learn­ing what to value. Ar­tifi­cial Gen­eral In­tel­li­gence 4th In­ter­na­tional Con­fer­ence Pro­ceed­ings: 309-314.

∙ Hut­ter (2005). Univer­sal Ar­tifi­cial In­tel­li­gence: Se­quence De­ci­sions Based on Al­gorith­mic Prob­a­bil­ity. Springer.

∙ Omo­hun­dro (2008). The ba­sic AI drives. Pro­ceed­ings of the First AGI Con­fer­ence: 483-492.

∙ Sch­mid­hu­ber (2007). New mil­len­nium AI and the con­ver­gence of his­tory. Stud­ies in Com­pu­ta­tional In­tel­li­gence, 63: 15-35.

∙ Tye (2013). Qualia. In Zalta (ed.), The Stan­ford En­cy­clo­pe­dia of Philos­o­phy.