The Self-Unaware AI Oracle

(See also my later post, Self-Su­per­vised Learn­ing and AGI Safety)

Ab­stract A self-un­aware AI or­a­cle is a sys­tem that builds a world-model in­side which there is no con­cept of “me”. I pro­pose that such sys­tems are safe to op­er­ate, that they can be plau­si­bly built by con­tin­u­ing known AI de­vel­op­ment paths, and that they would be com­pet­i­tive with any other type of AI—ca­pa­ble of things like in­vent­ing new tech­nol­ogy and safe re­cur­sive self-im­prove­ment.

Epistemic sta­tus: Treat as pre­limi­nary brain­storm­ing; please let me know any ideas, prob­lems, and rele­vant prior liter­a­ture.


My pro­posal is to build a sys­tem that takes in data about the world (say, all the data on the in­ter­net) and builds a prob­a­bil­is­tic gen­er­a­tive model of that data, in the sense of tak­ing an ar­bi­trary num­ber of bits and then pre­dict­ing one or more masked bits. If you want to imag­ine some­thing more spe­cific, here are three ex­am­ples: (1) A hy­per­com­puter us­ing Solomonoff In­duc­tion; (2) Some fu­ture de­scen­dant of GPT-2, us­ing a mas­sive Trans­former (or some bet­ter fu­ture ar­chi­tec­ture); (3) A neo­cor­tex-like al­gorithm that builds low-level pre­dic­tive mod­els of small seg­ments of data, then a hi­er­ar­chy of higher-level mod­els that pre­dict which of the lower-level mod­els will oc­cur in which or­der in differ­ent con­texts.[1]

Fur­ther, we de­sign the sys­tem to be “self-un­aware”—we want it to con­struct a gen­er­a­tive world-model in which its own phys­i­cal in­stan­ti­a­tion does not have any spe­cial sta­tus. The above (1-2) are ex­am­ples, as is (3) if you don’t take “neo­cor­tex-like” too liter­ally (the ac­tual neo­cor­tex can learn about it­self by, say, con­trol­ling mus­cles and ob­serv­ing the con­se­quences).

Fi­nally, we query the sys­tem in a way that is com­pat­i­ble with its self-un­aware­ness. For ex­am­ple, if we want to cure can­cer, one nice ap­proach would be to pro­gram it to search through its gen­er­a­tive model and out­put the least im­prob­a­ble sce­nario wherein a cure for can­cer is dis­cov­ered some­where in the world in the next 10 years. Maybe it would out­put: “A sci­en­tist at a uni­ver­sity will be test­ing im­mune ther­apy X, and they will com­bine it with blood ther­apy Y, and they’ll find that the two to­gether cure all can­cers”. Then, we go com­bine ther­a­pies X and Y our­selves.

What is self-un­aware­ness?

In the con­text of pre­dic­tive world-mod­el­ing sys­tems, a self-un­aware sys­tem (a term I just made up—is there an ex­ist­ing term?) is one that does not have it­self (or any part of it­self, or any con­se­quences of it­self) as spe­cially-flagged en­tities in its world model.

Ex­am­ple of a self-aware sys­tem: A tra­di­tional RL agent. (Why? Be­cause it has a spe­cial con­cept of “its own ac­tions” rep­re­sented in its mod­els.)

Ex­am­ple of a self-un­aware sys­tem: Any sys­tem that takes in­puts, does a de­ter­minis­tic com­pu­ta­tion (with no fur­ther in­puts from the en­vi­ron­ment), and spits out an out­put. (Why? Be­cause when you cor­rectly com­pute a com­putable func­tion, you get the same an­swer re­gard­less of where and whether the com­pu­ta­tion is phys­i­cally in­stan­ti­ated in the uni­verse.) (Edit to add: On sec­ond thought, this is wrong, ac­cord­ing to the defi­ni­tion of self-un­aware­ness that I’m us­ing ev­ery­where else. The “more ex­am­ples” sub­sec­tion is a bet­ter de­scrip­tion of what I’m get­ting at.)

In one sense, en­sur­ing that an AGI is self-un­aware seems like it should be pretty easy; “The space of all com­putable func­tions” is a pretty big space to ex­plore, and that doesn’t even ex­haust the pos­si­bil­ities! On the other hand, of course there are always pit­falls … for ex­am­ple, if your code has a race con­di­tion, that’s a side-chan­nel po­ten­tially leak­ing in­for­ma­tion from the phys­i­cal in­stan­ti­a­tion into the (sym­bolic) world model. Still, de­sign­ing a sys­tem to be self-un­aware seems pretty tractable, and maybe even amenable to for­mal ver­ifi­ca­tion (flag some vari­ables as “part of the world-model” and other vari­ables as “self-knowl­edge” and pre­vent them from in­ter­min­gling, or some­thing like that...).

More ex­am­ples of how to think about self-unawareness

  • If a self-un­aware sys­tem is, at some mo­ment, con­soli­dat­ing its knowl­edge of an­thro­pol­ogy, it doesn’t “know” that it is cur­rently con­soli­dat­ing its knowl­edge of an­thro­pol­ogy—this fact is not rep­re­sented in the world-model it’s build­ing.

  • If a self-un­aware sys­tem is run­ning on a par­tic­u­lar su­per­com­puter in Mex­ico, maybe its world-model “knows” (from news sto­ries) that there is a new AI re­search pro­ject us­ing this par­tic­u­lar su­per­com­puter in Mex­ico, but it won’t con­clude that this re­search pro­ject is “me”, be­cause, as far as it knows, there is no “me”; it is ut­terly ig­no­rant of its own ex­is­tence.

If you find this un­in­tu­itive, well, so do I! That’s be­cause self-un­aware sys­tems are su­per non-an­thro­po­mor­phic. If I’m able to think straight about this con­cept, it’s only by firmly ground­ing my­self in the three ex­am­ples I men­tioned in the Overview. For ex­am­ple, take a hy­per­com­puter, us­ing Solomonoff In­duc­tion to find the world-model that most par­si­mo­niously pre­dicts all the data on the in­ter­net. Does this world-model con­tain the state­ment: “I am a hy­per­com­puter run­ning Solomonoff In­duc­tion”? No!! That’s just not some­thing that would hap­pen in this sys­tem. (Cor­rec­tion[2])

Just as Hume said you can’t de­rive an “ought” from an “is”, my con­tention here is that you can’t de­rive a first-per­son per­spec­tive from any amount of third-per­son in­for­ma­tion.

How do you query a self-un­aware sys­tem?

There’s some awk­ward­ness in query­ing a self-un­aware sys­tem, be­cause it can’t just di­rectly ap­ply its in­tel­li­gence to un­der­stand­ing your ques­tions, nor to mak­ing it­self un­der­stood by you. Re­mem­ber, it doesn’t think of it­self as hav­ing in­put or out­put chan­nels, be­cause it doesn’t think of it­self pe­riod! Still, if we spend some time writ­ing (non-in­tel­li­gent) in­ter­face code, I think query­ing the sys­tem should ul­ti­mately work pretty well. The sys­tem does, af­ter all, have ex­cel­lent nat­u­ral-lan­guage un­der­stand­ing in­side of it.

I think the best bet is to pro­gram the sys­tem to make con­di­tional pre­dic­tions about the world, us­ing its world-model. I gave an ex­am­ple above: “Calcu­late the least im­prob­a­ble sce­nario, ac­cord­ing to your world-model, wherein a cure for can­cer is dis­cov­ered any­where in the world”. The sub­rou­tine does some calcu­la­tion, writes the an­swer to disk, and ter­mi­nates (of course, as always, it doesn’t know that it wrote the an­swer to disk, it just does it). As we read the an­swer, we in­ci­den­tally learn the cure for can­cer. I ex­pect that we would have some visi­bil­ity into its in­ter­nal world-model, but even a black-box pre­dic­tive world-mod­eler is prob­a­bly OK. Imag­ine prompt­ing GPT-2 for the most likely com­ple­tion of the sen­tence “In 2030 sci­en­tists fi­nally dis­cov­ered a cure for can­cer, and it was …”, or some­thing like that.

Can you give it mul­ti­ple queries in se­quence with­out re­set­ting it?

Yes, al­though things get a bit tricky when the sys­tem starts rea­son­ing about it­self. (It rea­sons about it­self in ex­actly the same way, and for ex­actly the same rea­son, as it rea­sons about any­thing else in the world.)

Sup­pose that we con­tinue al­low­ing our self-un­aware sys­tem to have read-only in­ter­net ac­cess, af­ter we’ve pub­lished the cure for can­cer from the pre­vi­ous sec­tion. Now plas­tered all over the news­pa­pers are sto­ries about the fa­mous Self-Unaware AI Or­a­cle run­ning in on a su­per­com­puter in Mex­ico, which has just in­vented the cure for can­cer. The sys­tem now will definitely put self-un­aware AI or­a­cles into its pre­dic­tive gen­er­a­tive world-model (if it hadn’t already), which en­tails try­ing to un­der­stand and pre­dict what such a thing would do in differ­ent cir­cum­stances. Maybe it even reads its own source code!

Un­for­tu­nately, it won’t be able to rea­son about it­self perfectly—that would re­quire simu­lat­ing it­self, which causes an in­finite regress. But the sys­tem will ap­ply all its mod­els and heuris­tics, and do the best it can to come up with a pre­dic­tive model of it­self, on the ba­sis of the (non-priv­ileged) in­for­ma­tion it has. (Note that there is no rea­son to ex­pect it to build a self-con­sis­tent model, i.e., a model in which its own pre­dic­tions about it­self are cor­rect.)

OK, now we go back and ask the sys­tem for some­thing else, say, a cure for Alzheimer’s. Again, we say “Please out­put the least im­prob­a­ble sce­nario, ac­cord­ing to your world-model, wherein Alzheimer’s is cured in the next 10 years”. It might say “Those sci­en­tists in Mex­ico, us­ing their Self-Unaware AI Or­a­cle, learn the fol­low­ing treat­ment ap­proach...”.

Now, re­call that with the clas­sic AI Or­a­cle, we worry about ma­nipu­la­tive an­swers. In this par­tic­u­lar case, we can be sure that the sys­tem it­self is not in­her­ently ma­nipu­la­tive (see be­low), but we would still be in trou­ble if the Self-Unaware AI Or­a­cle in Mex­ico be­lieves that the Self-Unaware AI Or­a­cle in Mex­ico would be ma­nipu­la­tive. This failure mode seems ex­tremely un­likely. As I men­tioned a cou­ple para­graphs ago, its third-per­son model of it­self will be an im­perfect ap­prox­i­ma­tion to the real thing, but the be­lief that it would be ma­nipu­la­tive would be a bizarrely spe­cific mod­el­ing er­ror that came out of nowhere!

Still, out of an abun­dance of cau­tion, the query should prob­a­bly be some­thing like: “Please out­put the least im­prob­a­ble sce­nario, ac­cord­ing to your world-model, wherein a cure for Alzheimer’s is in­vented by a group with no ac­cess to any AI or­a­cles!” Then we avoid the self-mod­el­ing is­sues al­to­gether. In­deed, it should be pos­si­ble to use a pre­dic­tive world model to do coun­ter­fac­tual rea­son­ing about what might hap­pen in a world in which there are no AIs what­so­ever.

Self-un­aware AIs are safe to use

As far as I can tell, a self-un­aware sys­tem should be about as safe as we could pos­si­bly hope for. Ma­nipu­la­tion, de­cep­tion, self-preser­va­tion, re­pro­duc­tion, and all the other Omo­hun­dro goals seem to be com­pletely out of the ques­tion for a sys­tem with no knowl­edge that it is a calcu­la­tion phys­i­cally in­stan­ti­ated in the uni­verse!

I’m try­ing hard to think of failure modes here. OK, there’s a po­ten­tial for the sys­tem to de­scribe pos­si­ble fu­tures while leav­ing out crit­i­cal de­tails. So, let’s say we ask for the least im­prob­a­ble sce­nario wherein “can­cer is cured”, and the sys­tem de­scribes the events lead­ing up to a plague that wipes out hu­man­ity. OK, well then we ask for the sec­ond least im­prob­a­ble sce­nario, and this one sounds pretty good! But af­ter we in­ject the chem­i­cal into the first test pa­tient, it turns out that this one is also a plague that wipes out hu­man­ity, but the AI’s model out­put was un­for­tu­nately treat­ing that fact as an ir­rele­vant de­tail and ne­glect­ing to men­tion it. Are we back to Good­hart’s law and the “near­est un­blocked strat­egy” prob­lem? Sort of, but I don’t think it’s par­tic­u­larly wor­ri­some ver­sion. Again, we’re ask­ing the sys­tem to guess what nor­mal hu­mans might plau­si­bly do in the nat­u­ral course of events (in the ab­sence of AI), and we can look at the prob­a­bil­ities as­signed to the sce­nar­ios to make sure they’re not out­landishly small. We can also pro­gram the sys­tem to an­swer our fol­low-up ques­tions, like “ac­cord­ing to your mod­els, if some­one rolls out this ther­apy, what is the likely im­pact on lifes­pan? what is the likely im­pact on the en­vi­ron­ment? how does it work on the cel­lu­lar level?” and so on. And we can trust that, while the an­swers may be im­perfect, they will not be ma­nipu­la­tive. I’m re­ally not see­ing any cause for con­cern here, or el­se­where, al­though I’m go­ing to keep think­ing about it.

Are Self-Unaware AI Or­a­cles com­pet­i­tive with other ap­proaches to AGI?

I see two main dis­ad­van­tages of Self-Unaware AI Or­a­cles, but I think that both are less prob­le­matic than they first ap­pear.

The first dis­ad­van­tage is that these things are com­pletely in­com­pat­i­ble with RL tech­niques (as far as I can tell), and a lot of peo­ple seem to think that RL is the path to su­per­in­tel­li­gence. Well, I’m not at all con­vinced that we need RL, or that RL would ul­ti­mately even be all that helpful. The al­ter­na­tive path I’m propos­ing here is self-su­per­vised learn­ing: Given a se­quence of bits from the in­ter­net, pre­dict the sub­se­quent bits. So there’s a mas­sive amount of train­ing data—for ex­am­ple, I heard that 100,000 years of video have been up­loaded to YouTube! I keep go­ing back to those three ex­am­ples from the be­gin­ning: (1) GPT-2 shows that we can get im­pres­sively far on this type of self-su­per­vised learn­ing even with to­day’s tech­nol­ogy; (2) Solomonoff in­duc­tion on the en­tire in­ter­net is the as­tro­nom­i­cally high ceiling on what’s pos­si­ble; and (3) the hu­man brain—which works pri­mar­ily on ex­actly this type of self-su­per­vised learn­ing[3]—is a nice refer­ence point for how far we might get along this path just by brute-force biomimetic en­g­ineer­ing.

The sec­ond dis­ad­van­tage is that it’s still an or­a­cle, need­ing a hu­man in the loop.[4] But as far as or­a­cles go, it’s about as pow­er­ful as you could hope for: able to an­swer more-or-less ar­bi­trary ques­tions, and able to de­sign new tech­nol­ogy, as in the can­cer ex­am­ple above. In par­tic­u­lar, we can take a boot­strap­ping strat­egy, where we can ask the safe self-un­aware or­a­cle to help us de­sign a safe AGI agent.

By the same to­ken, de­spite ap­pear­ances, Self-Unaware AI Or­a­cles are ca­pa­ble of re­cur­sive self-im­prove­ment: We just pre­sent the query in the third per­son. (“This is a Self-Unaware AI Or­a­cle”, we say to it, hold­ing up a gi­ant mir­ror. “How might sci­en­tists make this type of sys­tem bet­ter?”) We can even record the sys­tem do­ing a calcu­la­tion, then pass that video back to it­self as an in­put to im­prove its self-mod­els. I think this would be a quite safe type of self-im­prove­ment, in­so­far as self-un­aware­ness is (I hope) pos­si­ble to rigor­ously ver­ify, and also in­so­far as we’re not wor­ried about ma­nipu­la­tive sug­ges­tions.


Again, this is in­tu­ition-based brain­storm­ing, not rigor­ous ar­gu­ment, and I’m look­ing for­ward to any feed­back. For one thing, I think there are prob­a­bly bet­ter and more pre­cise ways to define self-un­aware­ness, but I hope my defi­ni­tion above is close enough to get the idea across. I’ll keep think­ing about it, and I hope oth­ers do too!

  1. See Jeff Hawk­ins On In­tel­li­gence or Andy Clark Sur­fing Uncer­tainty, for ex­am­ple. ↩︎

  2. Cor­rec­tion: I got this ex­am­ple wrong. The hy­per­com­puter chooses a pre­dic­tive al­gorithm, and the ques­tion is whether the lat­ter is self-un­aware. That’s not so ob­vi­ous… ↩︎

  3. See refer­ences in foot­note 1. Of course, the hu­man brain uses both self-su­per­vised learn­ing (pre­dict the next thing you’ll see, hear, and feel) and RL (cake is good, death is bad). My feel­ing is that we can throw out the RL part (or throw out enough of it to al­low self-un­aware­ness) and the sys­tem will still work pretty well. For ex­am­ple, when Ein­stein in­vented rel­a­tivity, he wasn’t do­ing RL in­ter­ac­tion with the real world, but rather search­ing through gen­er­a­tive mod­els, tweak­ing and re­com­bin­ing the higher-level mod­els and keep­ing them when they offered a par­si­mo­nious and ac­cu­rate pre­dic­tion of lower-level mod­els. I think we can write self-un­aware code that does that kind of thing. Without a re­ward sig­nal, we might need to pro­gram our own mechanism to di­rect “at­ten­tion”, i.e. to guide which as­pects of the world need to be mod­eled with ex­tra-high ac­cu­racy. But again, this seems like some­thing we can just put in man­u­ally. Note: I’m not ter­ribly con­fi­dent about any­thing in this foot­note, and want to think about it more. ↩︎

  4. If you try to make a self-un­aware AI agent, it im­me­di­ately starts mod­el­ing it­self and ad­just­ing its be­hav­ior to that model, which (as men­tioned above) yields hard-to-pre­dict and pos­si­bly-prob­le­matic be­hav­ior … un­less there’s a trick I haven’t thought of. ↩︎