# Dualism

Cur­rently, ma­chine learn­ing al­gorithms are es­sen­tially “Carte­sian du­al­ists” when it comes to them­selves and their en­vi­ron­ment. (Not a philos­o­phy ma­jor—let me know if I’m us­ing that term in­cor­rectly. But what I mean to say is...) If I give a ma­chine learn­ing al­gorithm some data about it­self as train­ing data, there’s no self-aware­ness there—it just chugs along look­ing for pat­terns like it would for any other prob­lem. I think it’s a rea­son­able guess that our al­gorithms will con­tinue to have this “no self-aware­ness” prop­erty as they be­come more and more ad­vanced. At the very least, this “busi­ness as usual” sce­nario seems worth an­a­lyz­ing in depth.

If du­al­ism holds for Abram’s pre­dic­tion AI, the “Pre­dict-O-Matic”, its world model may hap­pen to in­clude this thing called the Pre­dict-O-Matic which seems to make ac­cu­rate pre­dic­tions—but it’s not spe­cial in any way and isn’t be­ing mod­eled any differ­ently than any­thing else in the world. Again, I think this is a pretty rea­son­able guess for the Pre­dict-O-Matic’s de­fault be­hav­ior. I sus­pect other be­hav­ior would re­quire spe­cial code which at­tempts to pin­point the Pre­dict-O-Matic in its own world model and give it spe­cial treat­ment (an “ego”).

Let’s sup­pose the Pre­dict-O-Matic fol­lows a “re­cur­sive un­cer­tainty de­com­po­si­tion” strat­egy for mak­ing pre­dic­tions about the world. It mod­els the en­tire world in rather fuzzy re­s­olu­tion, mostly to know what’s im­por­tant for any given pre­dic­tion. If some as­pect of the world ap­pears es­pe­cially rele­vant to a pre­dic­tion it’s try­ing to make, it “zooms in” and tries to model that thing in higher re­s­olu­tion. And if some part of that thing seems es­pe­cially rele­vant, it zooms in fur­ther on that part. Etc.

Now sup­pose the Pre­dict-O-Matic is try­ing to make a pre­dic­tion, and its “re­cur­sive un­cer­tainty de­com­po­si­tion” al­gorithms say the next pre­dic­tion made by this Pre­dict-O-Matic thing which hap­pens to oc­cupy its world model ap­pears es­pe­cially rele­vant! What then?

At this point, the Pre­dict-O-Matic has stepped into a hall of mir­rors. To pre­dict the next pre­dic­tion made by the Pre­dict-O-Matic in its world model, the Pre­dict-O-Matic needs to run an in­ter­nal simu­la­tion of that Pre­dict-O-Matic. But as it runs that simu­la­tion, it finds that simu­la­tion kick­ing off an­other Pre­dict-O-Matic simu­la­tion in the simu­lated Pre­dict-O-Matic’s world model! Etc, etc.

So if the Pre­dict-O-Matic is im­ple­mented naively, the re­sult could just be an in­finite re­curse. Not use­ful, but not nec­es­sar­ily dan­ger­ous ei­ther.

Let’s sup­pose the Pre­dict-O-Matic has a non-naive im­ple­men­ta­tion and some­thing pre­vents this in­finite re­curse. For ex­am­ple, there’s a mon­i­tor pro­cess that no­tices when a model is eat­ing up a lot of com­pu­ta­tion with­out de­liv­er­ing use­ful re­sults, and re­places that model with one which is lower-re­s­olu­tion. Or maybe the Pre­dict-O-Matic does have a naive im­ple­men­ta­tion, but it doesn’t have enough data about it­self to model it­self in much de­tail, so it ends up us­ing a low-re­s­olu­tion model.

One pos­si­bil­ity is that it’s able to find a use­ful out­side view model such as “the Pre­dict-O-Matic has a his­tory of mak­ing nega­tive self-fulfilling prophe­cies”. This could lead to the Pre­dict-O-Matic mak­ing a nega­tive prophecy (“the Pre­dict-O-Matic will con­tinue to make nega­tive prophe­cies which re­sult in ter­rible out­comes”), but this prophecy wouldn’t be se­lected for be­ing self-fulfilling. And we might use­fully ask the Pre­dict-O-Matic whether the ter­rible self-fulfilling prophe­cies will con­tinue con­di­tional on us tak­ing Ac­tion A.

# An­swer­ing a Ques­tion by Hav­ing the Answer

If you aren’t already con­vinced, here’s an­other ex­pla­na­tion for why I don’t think the Pre­dict-O-Matic will make self-fulfilling prophe­cies by de­fault.

In Abram’s story, the en­g­ineer says: “The an­swer to a ques­tion isn’t re­ally sep­a­rate from the ex­pected ob­ser­va­tion. So ‘prob­a­bil­ity of ob­ser­va­tion de­pend­ing on that pre­dic­tion’ would trans­late to ‘prob­a­bil­ity of an event given that event’, which just has to be one.”

In other words, if the Pre­dict-O-Matic knows it will pre­dict P = A, it as­signs prob­a­bil­ity 1 to the propo­si­tion that it will pre­dict P = A.

I con­tend that Pre­dict-O-Matic doesn’t know it will pre­dict P = A at the rele­vant time. It would re­quire time travel—to know whether it will pre­dict P = A, it will have to have made a pre­dic­tion already, and but it’s still for­mu­lat­ing its pre­dic­tion as it thinks about what it will pre­dict.

More de­tails: Let’s taboo “Pre­dict-O-Matic” and in­stead talk about a “pre­dic­tive model” and “in­put data”. The trick is to avoid in­clud­ing the out­put of the pre­dic­tive model in the model’s in­put data. This isn’t pos­si­ble the first time we make a pre­dic­tion be­cause it would re­quire time travel—so as a prac­ti­cal mat­ter, we don’t want to re-run the model a sec­ond time with the pre­dic­tion from its first run in­cluded in the in­put data. Let’s say the dataset is kept com­pletely static dur­ing pre­dic­tion. (I offer no guaran­tees in the case where ob­ser­va­tional data about the model’s pre­dic­tion pro­cess is be­ing used to in­form the model while it makes a pre­dic­tion!)

To clar­ify fur­ther, let’s con­sider a non-Pre­dict-O-Matic sce­nario where is­sues do crop up. Sup­pose I’m a big shot stock an­a­lyst. I think Acme Corp’s stock is over­val­ued and will con­tinue to be over­val­ued in one month’s time. But be­fore an­nounc­ing my pre­dic­tion, I do a san­ity check. I no­tice that if I an­nounce my opinion, that could cause in­vestors to dump Acme, and Acme will likely no longer be over­val­ued in one month’s time. So I veto the column I was plan­ning to write on Acme, and in­stead search for a column c I can write such that c is a fixed point for the world ww(c) = c—even when the world is given my column as an in­put, what I pre­dicted in my column still comes true.

Note again that the san­ity check which leads to a search for a fixed point doesn’t hap­pen by de­fault—it re­quires some ex­tra func­tion­al­ity, be­yond what’s re­quired for naive pre­dic­tion, to im­ple­ment. The Pre­dict-O-Matic doesn’t care about look­ing bad, and there’s noth­ing con­tra­dic­tory about it pre­dict­ing that it won’t make the very pre­dic­tion it makes, or some­thing like that. Pre­dic­tive model, meet in­put data. That’s what it does.

# Open Questions

This is a sec­tion for half-baked thoughts that could grow into a coun­ter­ar­gu­ment for what I wrote above.

• What’s go­ing on when you try to model your­self think­ing about the an­swer to this ques­tion? (Why is this ques­tion so hard to think about? Maybe my brain has a mechanism to pre­vent in­finite re­curse? Tan­gent: I won­der if this is ev­i­dence of a mechanism that tries to pre­vent my brain from mak­ing pre­ma­ture pre­dic­tions by ob­serv­ing data about my own pre­dic­tive pro­cess while try­ing to make pre­dic­tions? Other­wise maybe there could be a cas­cade of up­dates where notic­ing that I’ve be­come more sure of some­thing makes me even more sure of it, etc.) Any­way, I think it’s im­por­tant to un­der­stand if my brain does some­thing “in prac­tice” which differs from what I’ve out­lined here, some kind of method for col­laps­ing re­cur­sion that a suffi­ciently ad­vanced Pre­dict-O-Matic might use.

• What if the Pre­dict-O-Matic as­signs some cre­dence to the idea that it’s “agen­tic” in na­ture? Then what if the Pre­dict-O-Matic as­signs some cre­dence to the idea that the simu­lated ver­sion of it­self could as­sign cre­dence to the idea that it’s in a simu­la­tion? (I think this is just a clas­sic dae­mon but maybe it differs in im­por­tant ways?)

• In ML, the pre­dic­tive model isn’t try­ing to max­i­mize its own ac­cu­racy—that’s what the train­ing al­gorithm tries to do. The pre­dic­tive model doesn’t seem like an op­ti­mizer even in the “math­e­mat­i­cal op­ti­miza­tion” sense of the world op­ti­miza­tion (is “mesa-op­ti­mizer” an ap­pro­pri­ate term? In this case, I think we’re glad it’s a mesa-op­ti­mizer?) What if the Pre­dict-O-Matic some­times runs a train­ing al­gorithm to up­date its model? How does that change things?

• What if time travel ac­tu­ally is pos­si­ble and we just haven’t dis­cov­ered it yet?

• Does any­thing in­ter­est­ing hap­pen if the Pre­dict-O-Matic be­comes aware of the con­cept of self-aware­ness?

# Prize Details

Again, \$100 prize for the first com­ment which crisply ex­plains some­thing that could wrong with du­al­ist Pre­dict-O-Matic. Con­test ends when I pub­lish my fol­low-up—prob­a­bly next Wed­nes­day the 23rd. I do have at least one an­swer in mind but I’m hop­ing you’ll come up with some­thing I haven’t thought of. How­ever, if I’m not con­vinced your thing would be a prob­lem, I can’t promise you a prize. No com­ment on whether “Open Ques­tions” are re­lated to the an­swer I have in mind.

• I’m not re­ally sure what you mean when you say “some­thing goes wrong” (in re­la­tion to the prize). I’ve been think­ing about all this in a very de­scrip­tive way, ie, I want to un­der­stand what hap­pens gen­er­ally, not force a par­tic­u­lar out­come. So I’m a lit­tle out-of-touch with the “goes wrong” fram­ing at the mo­ment. There are a lot of differ­ent things which could hap­pen. Which con­sti­tute “go­ing wrong”?

• Be­com­ing non-my­opic; ie, us­ing strate­gies which get lower pre­dic­tion loss long-term rather than on a per-ques­tion ba­sis.

• (Note this doesn’t nec­es­sar­ily mean plan­ning to do so, in an in­ner-op­ti­mizer way.)

• Mak­ing self-fulfilling prophe­cies in or­der to strate­gi­cally min­i­mize pre­dic­tion loss on in­di­vi­d­ual ques­tions (while pos­si­bly re­main­ing my­opic).

• Hav­ing a ten­dency for self-fulfilling prophe­cies at all (not nec­es­sar­ily strate­gi­cally min­i­miz­ing loss).

• Hav­ing a ten­dency for self-fulfilling prophe­cies, but not nec­es­sar­ily the ones which so­ciety has cur­rently con­verged to (eg, dis­rupt­ing ex­ist­ing equil­ibria about money be­ing valuable be­cause ev­ery­one ex­pects things to stay that way).

• Strate­gi­cally min­i­miz­ing pre­dic­tion loss in any way other than by giv­ing bet­ter an­swers in an in­tu­itive sense.

• Ma­nipu­lat­ing the world strate­gi­cally in any way, to­ward any end.

• Catas­trophic risk by any means (not nec­es­sar­ily due to strate­gic ma­nipu­la­tion).

In par­tic­u­lar, in­ner mis­al­ign­ment seems like some­thing you aren’t in­clud­ing in your “go­ing wrong”? (Since it seems like an easy an­swer to your challenge.)

I note that the re­cur­sive-de­com­po­si­tion type sys­tem you de­scribe is very differ­ent from most mod­ern ML, and differ­ent from the “ba­si­cally gra­di­ent de­scent” sort of thing I was imag­in­ing in the story. (We might nat­u­rally sup­pose that Pre­dict-O-Matic has some “se­cret sauce” though.)

If you aren’t already con­vinced, here’s an­other ex­pla­na­tion for why I don’t think the Pre­dict-O-Matic will make self-fulfilling prophe­cies by de­fault.
In Abram’s story, the en­g­ineer says: “The an­swer to a ques­tion isn’t re­ally sep­a­rate from the ex­pected ob­ser­va­tion. So ‘prob­a­bil­ity of ob­ser­va­tion de­pend­ing on that pre­dic­tion’ would trans­late to ‘prob­a­bil­ity of an event given that event’, which just has to be one.”
In other words, if the Pre­dict-O-Matic knows it will pre­dict P = A, it as­signs prob­a­bil­ity 1 to the propo­si­tion that it will pre­dict P = A.

Right, ba­si­cally by defi­ni­tion. The word ‘given’ was in­tended in the Bayesian sense, ie, con­di­tional prob­a­bil­ity.

I con­tend that Pre­dict-O-Matic doesn’t know it will pre­dict P = A at the rele­vant time. It would re­quire time travel—to know whether it will pre­dict P = A, it will have to have made a pre­dic­tion already, and but it’s still for­mu­lat­ing its pre­dic­tion as it thinks about what it will pre­dict.

It’s quite pos­si­ble that the Pre­dict-O-Matic has be­come rel­a­tively pre­dictable-by-it­self, so that it gen­er­ally has good (not perfect) guesses about what it is about to pre­dict. I don’t mean that it is in an equil­ibrium with it­self; its pre­dic­tions may be shift­ing in pre­dictable di­rec­tions. If these shifts be­come large enough, or if its pre­dictabil­ity goes sec­ond-or­der (it pre­dicts that it’ll pre­dict its own out­put, and thus pre-an­ti­ci­pates the di­rec­tion of shift re­cur­sively) it has to stop know­ing its own out­put in so much de­tail (it’s chang­ing too fast to learn about). But it can pos­si­bly know a lot about its out­put.

I definitely agree with most of the stuff in the ‘an­swer­ing a ques­tion by hav­ing the an­swer’ sec­tion. Whether a sys­tem ex­plic­itly makes the pre­dic­tion into a fixed point is a crit­i­cal ques­tion, which will de­ter­mine which way some of these is­sues go.

• If the sys­tem does, then there are ex­plicit ‘han­dles’ to op­ti­mize the world by se­lect­ing which self-fulfilling prophe­cies to make true. We are effec­tively forced to deal with the is­sue (if only by ran­dom se­lec­tion).

• If the sys­tem doesn’t, then we lack such han­dles, but the sys­tem still has to do some­thing in the face of such situ­a­tions. It may con­verge to self-fulfilling stuff. It may not, and so, pro­duce ‘in­con­sis­tent’ out­puts for­ever. This will de­pend on fea­tures of the learn­ing al­gorithm as well as fea­tures of the situ­a­tion it finds it­self in.

It seems a bit like you might be equat­ing the sec­ond op­tion with “does not pro­duce self-fulfilling prophe­cies”, which I think would be a mis­take.

• In­tu­itively, things go wrong if you get un­ex­pected, un­wanted, po­ten­tially catas­trophic be­hav­ior. Ba­si­cally, if it’s some­thing we’d want to fix be­fore us­ing this thing in pro­duc­tion. I think most of your bul­let points qual­ify, but if you give an ex­am­ple which falls un­der one of those bul­let points, yet doesn’t seem like it’d be much of a con­cern in prac­tice (very lit­tle catas­trophic po­ten­tial), that might not get a prize.

In par­tic­u­lar, in­ner mis­al­ign­ment seems like some­thing you aren’t in­clud­ing in your “go­ing wrong”? (Since it seems like an easy an­swer to your challenge.)

Thanks for bring­ing that up. Yes, I am look­ing speci­fi­cally for defeaters aimed in the gen­eral di­rec­tion of the points I made in this post. Bring­ing up generic widely known safety con­cerns that many de­signs are po­ten­tially sus­cep­ti­ble to does not qual­ify.

I note that the re­cur­sive-de­com­po­si­tion type sys­tem you de­scribe is very differ­ent from most mod­ern ML, and differ­ent from the “ba­si­cally gra­di­ent de­scent” sort of thing I was imag­in­ing in the story. (We might nat­u­rally sup­pose that Pre­dict-O-Matic has some “se­cret sauce” though.)

I think there’s po­ten­tially an anal­ogy with at­ten­tion in the con­text of deep learn­ing, but it’s pretty loose.

It seems a bit like you might be equat­ing the sec­ond op­tion with “does not pro­duce self-fulfilling prophe­cies”, which I think would be a mis­take.

Do you mean to say that a prophecy might hap­pen to be self-fulfilling even if it wasn’t op­ti­mized for be­ing so? Or are you try­ing to dis­t­in­guish be­tween “ex­plicit” and “im­plicit” searches for fixed points? Or are you try­ing to dis­t­in­guish be­tween fixed points and self-fulfilling prophe­cies some­how? (I thought they were ba­si­cally the same thing.)

• Do you mean to say that a prophecy might hap­pen to be self-fulfilling even if it wasn’t op­ti­mized for be­ing so? Or are you try­ing to dis­t­in­guish be­tween “ex­plicit” and “im­plicit” searches for fixed points?

More the sec­ond than the first, but I’m also say­ing that the line be­tween the two is blurry.

For ex­am­ple, sup­pose there is some­one who will of­ten do what pre­dict-o-matic pre­dicts if they can un­der­stand how to do it. They of­ten ask it what they are go­ing to do. At first, pre­dict-o-matic pre­dicts them as usual. This mod­ifies their be­hav­ior to be some­what more pre­dictable than it nor­mally would be. Pre­dict-o-matic locks into the pat­terns (es­pe­cially the pre­dic­tions which work the best as sug­ges­tions). Be­hav­ior gets even more reg­u­lar. And so on.

You could say that no one is op­ti­miz­ing for fixed-point-ness here, and pre­dict-o-matic is just chanc­ing into it. But effec­tively, there’s an op­ti­miza­tion im­ple­mented by the pair of the pre­dict-o-matic and the per­son.

In situ­a­tions like that, you get into an op­ti­mized fixed point over time, even though the learn­ing al­gorithm it­self isn’t ex­plic­itly search­ing for that.

• To high­light the “blurry dis­tinc­tion” more:

In situ­a­tions like that, you get into an op­ti­mized fixed point over time, even though the learn­ing al­gorithm it­self isn’t ex­plic­itly search­ing for that.

Note, if the pre­dic­tion al­gorithm an­ti­ci­pates this pro­cess (per­haps par­tially), it will “jump ahead”, so that con­ver­gence to a fixed point hap­pens more within the com­pu­ta­tion of the pre­dic­tor (less over steps of real world in­ter­ac­tion). This isn’t for­mally the same as search­ing for fixed points in­ter­nally (you will get much weaker guaran­tees out of this hap­haz­ard pro­cess), but it does mean op­ti­miza­tion for fixed point find­ing is hap­pen­ing within the sys­tem un­der some con­di­tions.

• What’s go­ing on when you try to model your­self think­ing about the an­swer to this ques­tion?

If a sys­tem is an­a­lyz­ing (it­self an­a­lyz­ing (it­self an­a­lyz­ing (...))) , not re­al­iz­ing it’s do­ing so, I sus­pect that it will come up with some best guess an­swer, but that an­swer will be ill-de­ter­mined and de­pen­dent on im­ple­men­ta­tion de­tails. Thus a bet­ter ap­proach would be to avoid ask­ing self-un­aware sys­tems any ques­tion that re­quires that type of anal­y­sis!

For ex­am­ple, you can ask “Please out­put the least im­prob­a­ble sce­nario, ac­cord­ing to your pre­dic­tive world-model, wherein a cure for Alzheimer’s is in­vented by a group of peo­ple with no ac­cess to any AI or­a­cles!” Or even ask it to do coun­ter­fac­tual rea­son­ing about what might hap­pen in a world in which there are no AIs what­so­ever. (copied from my post here). This type of ques­tion is nice for other rea­sons too—we’re ask­ing the sys­tem to guess what nor­mal hu­mans might plau­si­bly do in the nat­u­ral course of events, and thus we’ll more typ­i­cally get nor­mal-hu­man-type solu­tions to our prob­lems as op­posed to bizarre alien hu­man-un­friendly solu­tions.

• If du­al­ism holds for Abram’s pre­dic­tion AI, the “Pre­dict-O-Matic”, its world model may hap­pen to in­clude this thing called the Pre­dict-O-Matic which seems to make ac­cu­rate pre­dic­tions—but it’s not spe­cial in any way and isn’t be­ing mod­eled any differ­ently than any­thing else in the world. Again, I think this is a pretty rea­son­able guess for the Pre­dict-O-Matic’s de­fault be­hav­ior. I sus­pect other be­hav­ior would re­quire spe­cial code which at­tempts to pin­point the Pre­dict-O-Matic in its own world model and give it spe­cial treat­ment (an “ego”).

I don’t think this is right. In par­tic­u­lar, I think we should ex­pect ML to be bi­ased to­wards sim­ple func­tions such that if there’s a sim­ple and ob­vi­ous com­pres­sion, then you should ex­pect ML to take it. In par­tic­u­lar, hav­ing an “ego” which iden­ti­fies it­self with its model of it­self sig­nifi­cantly re­duces de­scrip­tion length by not hav­ing to du­pli­cate a bunch of in­for­ma­tion about its own de­ci­sion-mak­ing pro­cess.

• In par­tic­u­lar, I think we should ex­pect ML to be bi­ased to­wards sim­ple func­tions such that if there’s a sim­ple and ob­vi­ous com­pres­sion, then you should ex­pect ML to take it.

Yes, for the most part.

In par­tic­u­lar, hav­ing an “ego” which iden­ti­fies it­self with its model of it­self sig­nifi­cantly re­duces de­scrip­tion length by not hav­ing to du­pli­cate a bunch of in­for­ma­tion about its own de­ci­sion-mak­ing pro­cess.

I think maybe you’re pre-sup­pos­ing what you’re try­ing to show. Most of the time, when I train a ma­chine learn­ing model on some data, that data isn’t data about the ML train­ing al­gorithm or model it­self. This info is usu­ally not part of the dataset whose de­scrip­tion length the sys­tem is at­tempt­ing to min­i­mize.

A ma­chine learn­ing model doesn’t get un­der­stand­ing of or data about its code “for free”, in the same way we don’t get knowl­edge of how brains work “for free” de­spite the fact that we are brains. Hu­mans get self-knowl­edge in ba­si­cally the same way we get any other kind of knowl­edge—by mak­ing ob­ser­va­tions. We aren’t ex­pert neu­ro­scien­tists from birth. Part of what I’m try­ing to in­di­cate with the “du­al­ist” term is that this Pre­dict-O-Matic is the same way, i.e. its po­si­tion with re­spect to it­self is similar to the po­si­tion of an as­piring neu­ro­scien­tist with re­spect to their own brain.

• Most of the time, when I train a ma­chine learn­ing model on some data, that data isn’t data about the ML train­ing al­gorithm or model it­self.

If the data isn’t at all about the ML train­ing al­gorithm, then why would it even build a model of it­self in the first place, re­gard­less of whether it was du­al­ist or not?

A ma­chine learn­ing model doesn’t get un­der­stand­ing of or data about its code “for free”, in the same way we don’t get knowl­edge of how brains work “for free” de­spite the fact that we are brains.

We might not have good mod­els of brains, but we do have very good mod­els of our­selves, which is the ac­tual anal­ogy here. You don’t have to have a good model of your brain to have a good model of your­self, and to iden­tify that model of your­self with your own ac­tions (i.e. the thing you called an “ego”).

Part of what I’m try­ing to in­di­cate with the “du­al­ist” term is that this Pre­dict-O-Matic is the same way, i.e. its po­si­tion with re­spect to it­self is similar to the po­si­tion of an as­piring neu­ro­scien­tist with re­spect to their own brain.

Also, if you think that, then I’m con­fused why you think this is a good safety prop­erty; hu­man neu­ro­scien­tists are pre­cisely the sort of highly agen­tic mis­al­igned mesa-op­ti­miz­ers that you pre­sum­ably want to avoid when you just want to build a good pre­dic­tion ma­chine.

--

I think I didn’t fully con­vey my pic­ture here, so let me try to ex­plain how I think this could hap­pen. Sup­pose you’re train­ing a pre­dic­tor and the data in­cludes enough in­for­ma­tion about it­self that it has to form some model of it­self. Once that’s hap­pened—or while it’s in the pro­cess of hap­pen­ing—there is a mas­sive du­pli­ca­tion of in­for­ma­tion be­tween the part of the model that en­codes its pre­dic­tion ma­chin­ery and the part that en­codes its model of it­self. A much sim­pler model would be one that just uses the same ma­chin­ery for both, and since ML is bi­ased to­wards sim­ple mod­els, you should ex­pect it to be shared—which is pre­cisely the thing you were call­ing an “ego.”

• When you wrote

hav­ing an “ego” which iden­ti­fies it­self with its model of it­self sig­nifi­cantly re­duces de­scrip­tion length by not hav­ing to du­pli­cate a bunch of in­for­ma­tion about its own de­ci­sion-mak­ing pro­cess.

that sug­gested to me that there were 2 in­stances of this info about Pre­dict-O-Matic’s de­ci­sion-mak­ing pro­cess in the dataset whose de­scrip­tion length we’re try­ing to min­i­mize. “De-du­pli­ca­tion” only makes sense if there’s more than one. Why is there more than one?

We might not have good mod­els of brains, but we do have very good mod­els of our­selves, which is the ac­tual anal­ogy here. You don’t have to have a good model of your brain to have a good model of your­self, and to iden­tify that model of your­self with your own ac­tions (i.e. the thing you called an “ego”).

Some­times peo­ple take psychedelic drugs/​med­i­tate and re­port an out of body ex­pe­rience, one­ness with the uni­verse, ego dis­solu­tion, etc. This sug­gests to me that ego is an evolved adap­ta­tion rather than a ne­ces­sity for cog­ni­tion. A clue is the fact that our ego ex­tends to all parts of our body, even those which aren’t nec­es­sary for com­pu­ta­tion (but are nec­es­sary for sur­vival & re­pro­duc­tion)

there is a mas­sive du­pli­ca­tion of in­for­ma­tion be­tween the part of the model that en­codes its pre­dic­tion ma­chin­ery and the part that en­codes its model of it­self.

The pre­dic­tion ma­chin­ery is in code, but this code isn’t part of the info whose de­scrip­tion length is at­tempt­ing to be min­i­mized, un­less we take spe­cial ac­tion to in­clude it in that info. That’s the point I was try­ing to make pre­vi­ously.

Com­pres­sion has im­por­tant similar­i­ties to pre­dic­tion. In com­pres­sion terms, your ar­gu­ment is es­sen­tially that if we use zip to com­press its own source code, it will be able to com­press its own source code us­ing a very small num­ber of bytes, be­cause it “already knows about it­self”.

• that sug­gested to me that there were 2 in­stances of this info about Pre­dict-O-Matic’s de­ci­sion-mak­ing pro­cess in the dataset whose de­scrip­tion length we’re try­ing to min­i­mize. “De-du­pli­ca­tion” only makes sense if there’s more than one. Why is there more than one?

ML doesn’t min­i­mize the de­scrip­tion length of the dataset—I’m not even sure what that might mean—rather, it min­i­mizes the de­scrip­tion length of the model. And the model does con­tain two copies of in­for­ma­tion about Pre­dict-O-Matic’s de­ci­sion-mak­ing pro­cess—one in its pre­dic­tion pro­cess and one in its world model.

The pre­dic­tion ma­chin­ery is in code, but this code isn’t part of the info whose de­scrip­tion length is at­tempt­ing to be min­i­mized, un­less we take spe­cial ac­tion to in­clude it in that info. That’s the point I was try­ing to make pre­vi­ously.

Modern pre­dic­tive mod­els don’t have some sep­a­rate hard-coded piece that does pre­dic­tion—in­stead you just train ev­ery­thing. If you con­sider GPT-2, for ex­am­ple, it’s just a bunch of trans­form­ers hooked to­gether. The only in­for­ma­tion that isn’t in­cluded in the de­scrip­tion length of the model is what trans­form­ers are, but “what’s a trans­former” is quite differ­ent than “how do I make pre­dic­tions.” All of the in­for­ma­tion about how the model ac­tu­ally makes its pre­dic­tions in that sort of a setup is go­ing to be trained.

• I think maybe what you’re get­ting at is that if we try to get a ma­chine learn­ing model to pre­dict its own pre­dic­tions (i.e. we give it a bunch of data which con­sists of la­bels that it made it­self), it will do this very eas­ily. Agreed. But that doesn’t im­ply it’s aware of “it­self” as an en­tity. And in some cases the rele­vant as­pect of its in­ter­nals might not be available as a con­cep­tual build­ing block. For ex­am­ple, a model trained us­ing stochas­tic gra­di­ent de­scent is not nec­es­sar­ily bet­ter at un­der­stand­ing or pre­dict­ing a pro­cess which is very similar to stochas­tic gra­di­ent de­scent.

Fur­ther­more, sup­pose that we take the weights for a par­tic­u­lar model, mask some of those weights out, use them as the la­bels y, and try to pre­dict them us­ing the other weights in that layer as fea­tures x. The model will perform ter­ribly on this be­cause it’s not the task that it was trained for. It doesn’t mag­i­cally have the “self-aware­ness” nec­es­sary to see what’s go­ing on.

In or­der to be crisp about what could hap­pen, your ex­pla­na­tion also has to ac­count for what clearly won’t hap­pen.

BTW this thread also seems rele­vant: https://​​www.less­wrong.com/​​posts/​​RmPKdMqSr2xRwrqyE/​​the-du­al­ist-pre­dict-o-matic-usd100-prize#Avb­nFiKpJxDqM8GYh

• I think maybe what you’re get­ting at is that if we try to get a ma­chine learn­ing model to pre­dict its own pre­dic­tions (i.e. we give it a bunch of data which con­sists of la­bels that it made it­self), it will do this very eas­ily. Agreed. But that doesn’t im­ply it’s aware of “it­self” as an en­tity.

No, but it does im­ply that it has the in­for­ma­tion about its own pre­dic­tion pro­cess en­coded in its weights such that there’s no rea­son it would have to en­code that in­for­ma­tion twice by also re-en­cod­ing it as part of its knowl­edge of the world as well.

Fur­ther­more, sup­pose that we take the weights for a par­tic­u­lar model, mask some of those weights out, use them as the la­bels y, and try to pre­dict them us­ing the other weights in that layer as fea­tures x. The model will perform ter­ribly on this be­cause it’s not the task that it was trained for. It doesn’t mag­i­cally have the “self-aware­ness” nec­es­sary to see what’s go­ing on.

Sure, but that’s not ac­tu­ally the rele­vant task here. It may not un­der­stand its own weights, but it does un­der­stand its own pre­dic­tive pro­cess, and thus its own out­put, such that there’s no rea­son it would en­code that in­for­ma­tion again in its world model.

• No, but it does im­ply that it has the in­for­ma­tion about its own pre­dic­tion pro­cess en­coded in its weights such that there’s no rea­son it would have to en­code that in­for­ma­tion twice by also re-en­cod­ing it as part of its knowl­edge of the world as well.

OK, it sounds like we agree then? Like, the Pre­dict-O-Matic might have an un­usu­ally easy time mod­el­ing it­self in cer­tain ways, but other than that, it doesn’t get spe­cial treat­ment be­cause it has no spe­cial aware­ness of it­self as an en­tity?

Edit: Try­ing to provide an in­tu­ition pump for what I mean here—in or­der to avoid du­pli­cat­ing in­for­ma­tion, I might as­sume that some­thing which looks like a sta­pler be­haves the same way as other things I’ve seen which looks like sta­plers—but that doesn’t mean I think all sta­plers are the same ob­ject. It might in some cases be sen­si­ble to no­tice that I keep see­ing a sta­pler ly­ing around and hy­poth­e­size that there’s just one sta­pler which keeps get­ting moved around the office. But that re­quires that I per­ceive the sta­pler as an en­tity ev­ery time I see it, so en­tities which were pre­vi­ously sep­a­rate in my head can be merged. Whereas ar­guendo, my pre­dic­tion ma­chin­ery isn’t nec­es­sar­ily an en­tity that I rec­og­nize; it’s more like the wa­ter I’m swim­ming in in some sense.

• I don’t think we do agree, in that I think pres­sure to­wards sim­ple mod­els im­plies that they won’t be du­al­ist in the way that you’re claiming.

• If du­al­ism holds for Abram’s pre­dic­tion AI, the “Pre­dict-O-Matic”, its world model may hap­pen to in­clude this thing called the Pre­dict-O-Matic which seems to make ac­cu­rate pre­dic­tions—but it’s not spe­cial in any way and isn’t be­ing mod­eled any differ­ently than any­thing else in the world. Again, I think this is a pretty rea­son­able guess for the Pre­dict-O-Matic’s de­fault be­hav­ior. I sus­pect other be­hav­ior would re­quire spe­cial code which at­tempts to pin­point the Pre­dict-O-Matic in its own world model and give it spe­cial treat­ment (an “ego”).

I don’t see why we should ex­pect this. We’re told that the Pre­dict-O-Matic is be­ing trained with some­thing like sgd, and sgd doesn’t re­ally care about whether the model it’s im­ple­ment­ing is du­al­ist or non-du­al­ist; it just tries to find a model that gen­er­ates a lot of re­ward. In par­tic­u­lar, this seems wrong to me:

The Pre­dict-O-Matic doesn’t care about look­ing bad, and there’s noth­ing con­tra­dic­tory about it pre­dict­ing that it won’t make the very pre­dic­tion it makes, or some­thing like that.

If the Pre­dict-O-Matic has a model that makes bad pre­dic­tion (i.e. looks bad), that model will be se­lected against. And if it ac­ci­den­tally stum­bled upon a model that could cor­rectly think about it’s own be­havi­our in a non-du­al­ist fash­ion, and find fixed points, that model would be se­lected for (since its pre­dic­tions come true). So at least in the limit of search and ex­plo­ra­tion, we should ex­pect sgd to end up with a model that finds fixed points, if we train it in a situ­a­tion where its pre­dic­tions af­fect the fu­ture.

If we only train it on data where it can’t af­fect the data that it’s eval­u­ated against, and then freeze the model, I agree that it prob­a­bly won’t ex­hibit this kind of be­havi­our; is that the sce­nario that you’re think­ing about?

• it just tries to find a model that gen­er­ates a lot of reward

SGD searches for a set of pa­ram­e­ters which min­i­mize a loss func­tion. Selec­tion, not con­trol.

If the Pre­dict-O-Matic has a model that makes bad pre­dic­tion (i.e. looks bad), that model will be se­lected against.

Only if that info is in­cluded in the dataset that SGD is try­ing to min­i­mize a loss func­tion with re­spect to.

And if it ac­ci­den­tally stum­bled upon a model that could cor­rectly think about it’s own be­havi­our in a non-du­al­ist fash­ion, and find fixed points, that model would be se­lected for (since its pre­dic­tions come true).

Sup­pose we’re run­ning SGD try­ing to find a model which min­i­mizes the loss over a set of (situ­a­tion, out­come) pairs. Sup­pose some of the situ­a­tions are situ­a­tions in which the Pre­dict-O-Matic made a pre­dic­tion, and that pre­dic­tion turned out to be false. It’s con­ceiv­able that SGD could learn that the Pre­dict-O-Matic pre­dict­ing some­thing makes it less likely to hap­pen and use that as a fea­ture. How­ever, this wouldn’t be helpful be­cause the Pre­dict-O-Matic doesn’t know what pre­dic­tion it will make at test time. At best it could in­fer that some of its older pre­dic­tions will prob­a­bly end up be­ing false and use that fact to in­form the thing it’s cur­rently try­ing to pre­dict.

If we only train it on data where it can’t af­fect the data that it’s eval­u­ated against, and then freeze the model, I agree that it prob­a­bly won’t ex­hibit this kind of be­havi­our; is that the sce­nario that you’re think­ing about?

Not nec­es­sar­ily. The sce­nario I have in mind is the stan­dard ML sce­nario where SGD is just try­ing to find some pa­ram­e­ters which min­i­mize a loss func­tion which is sup­posed to ap­prox­i­mate the pre­dic­tive ac­cu­racy of those pa­ram­e­ters. Then we use those pa­ram­e­ters to make pre­dic­tions. SGD isn’t con­cerned with fu­ture hy­po­thet­i­cal rounds of SGD on fu­ture hy­po­thet­i­cal datasets. In some sense, it’s not even con­cerned with pre­dic­tive ac­cu­racy ex­cept in­so­far as train­ing data hap­pens to gen­er­al­ize to new data.

If you think in­clud­ing his­tor­i­cal ob­ser­va­tions of a Pre­dict-O-Matic (which hap­pens to be ‘one­self’) mak­ing bad (or good) pre­dic­tions in the Pre­dict-O-Matic’s train­ing dataset will cause a catas­tro­phe, that’s within the range of sce­nar­ios I care about, so please do ex­plain!

By the way, if any­one wants to un­der­stand the stan­dard ML sce­nario more deeply, I recom­mend this class.

• I think our dis­agree­ment comes from you imag­in­ing offline learn­ing, while I’m imag­in­ing on­line learn­ing. If we have a pre­defined set of (situ­a­tion, out­come) pairs, then the Pre­dict-O-Matic’s pre­dic­tions ob­vi­ously can’t af­fect the data that it’s eval­u­ated against (the out­come), so I agree that it’ll end up pretty du­al­is­tic. But if we put a Pre­dict-O-Matic in the real world, let it gen­er­ate pre­dic­tions, and then define the loss ac­cord­ing to what hap­pens af­ter­wards, a non-du­al­is­tic Pre­dict-O-Matic will be se­lected for over du­al­is­tic var­i­ants.

If you still dis­agree with that, what do you think would hap­pen (in the limit of in­finite train­ing time) with an al­gorithm that just made a ran­dom change pro­por­tional to how wrong it was, at ev­ery train­ing step? Think­ing about SGD is a bit com­pli­cated, since it calcu­lates the gra­di­ent while as­sum­ing that the data stays con­stant, but if we use on­line train­ing on an al­gorithm that just tries things un­til some­thing works, I’m pretty con­fi­dent that it’d end up look­ing for fixed points.

• But if we put a Pre­dict-O-Matic in the real world, let it gen­er­ate pre­dic­tions, and then define the loss ac­cord­ing to what hap­pens af­ter­wards, a non-du­al­is­tic Pre­dict-O-Matic will be se­lected for over du­al­is­tic var­i­ants.

Yes, that sounds more like re­in­force­ment learn­ing. It is not the de­sign I’m try­ing to point at in this post.

If you still dis­agree with that, what do you think would hap­pen (in the limit of in­finite train­ing time) with an al­gorithm that just made a ran­dom change pro­por­tional to how wrong it was, at ev­ery train­ing step?

That de­scrip­tion sounds a lot like SGD. I think you’ll need to be crisper for me to see what you’re get­ting at.

• Yes, that sounds more like re­in­force­ment learn­ing. It is not the de­sign I’m try­ing to point at in this post.

Ok, cool, that ex­plains it. I guess the main differ­ences be­tween RL and on­line su­per­vised learn­ing is whether the model takes ac­tions that can af­fect their en­vi­ron­ment or only makes pre­dic­tions of fixed data; so it seems plau­si­ble that some­one train­ing the Pre­dict-O-Matic like that would think they’re do­ing su­per­vised learn­ing, while they’re ac­tu­ally closer to RL.

That de­scrip­tion sounds a lot like SGD. I think you’ll need to be crisper for me to see what you’re get­ting at.

No need, since we already found the point of dis­agree­ment. (But if you’re cu­ri­ous, the differ­ence is that sgd makes a change in the di­rec­tion of the gra­di­ent, and this one wouldn’t.)

• it seems plau­si­ble that some­one train­ing the Pre­dict-O-Matic like that would think they’re do­ing su­per­vised learn­ing, while they’re ac­tu­ally closer to RL.

How’s that?

• As­sum­ing that peo­ple don’t think about the fact that Pre­dict-O-Matic’s pre­dic­tions can af­fect re­al­ity (which seems like it might have been true early on in the story, al­though it’s ad­mit­tedly un­likely to be true for too long in the real world), they might de­cide to train it by let­ting it make pre­dic­tions about the fu­ture (defin­ing and back­prop­a­gat­ing the loss once the fu­ture comes about). They might think that this is just like train­ing on pre­defined data, but now the Pre­dict-O-Matic can change the data that it’s eval­u­ated against, so there might be any num­ber of ‘cor­rect’ an­swers (rather than ex­actly 1). Although it’s a blurry line, I’d say this makes it’s out­put more ac­tion-like and less pre­dic­tion-like, so you could say that it makes the train­ing pro­cess a bit more RL-like.

• I think it de­pends on in­ter­nal de­tails of the Pre­dict-O-Matic’s pre­dic­tion pro­cess. If it’s still us­ing SGD, SGD is not go­ing to play the fu­ture for­ward to see the new feed­back mechanism you’ve de­scribed and in­cor­po­rate it into the loss func­tion which is be­ing min­i­mized. How­ever, it’s con­ceiv­able that given a dataset about its own past pre­dic­tions and how they turned out, the Pre­dict-O-Matic might learn to make its pre­dic­tions “more self-fulfilling” in or­der to min­i­mize loss on that dataset?

• SGD is not go­ing to play the fu­ture for­ward to see the new feed­back mechanism you’ve de­scribed and in­cor­po­rate it into the loss func­tion which is be­ing minimized

My ‘new feed­back mechanism’ is part of the train­ing pro­ce­dure. It’s not go­ing to be good at that by ‘play­ing the fu­ture for­ward’, it’s go­ing to be­come good at that by be­ing trained on it.

I sus­pect we’re us­ing SGD in differ­ent ways, be­cause ev­ery­thing we’ve talked about seems like it could be im­ple­mented with SGD. Do you agree that let­ting the Pre­dict-O-Matic pre­dict the fu­ture and re­ward­ing it for be­ing right, RL-style, would lead to it find­ing fixed points? Be­cause you can definitely use SGD to do RL (first google re­sult).

• I sus­pect we’re us­ing SGD in differ­ent ways, be­cause ev­ery­thing we’ve talked about seems like it could be im­ple­mented with SGD. Do you agree that let­ting the Pre­dict-O-Matic pre­dict the fu­ture and re­ward­ing it for be­ing right, RL-style, would lead to it find­ing fixed points? Be­cause you can definitely use SGD to do RL (first google re­sult).

Fair enough, I was think­ing about su­per­vised learn­ing.

• Two re­marks.

Re­mark 1: Here’s a sim­ple model of self-fulfilling prophe­cies.

First, we need to de­cide how Pre­dict-O-Matic out­puts its pre­dic­tions. In prin­ci­ple, it could (i) pro­duce the max­i­mum like­li­hood out­come (ii) pro­duce the en­tire dis­tri­bu­tion over out­comes (iii) sam­ple an out­come of the dis­tri­bu­tion. But, since Pre­dict-O-Matic is sup­posed to pro­duce pre­dic­tions for large vol­ume data (e.g. the inau­gu­ra­tion speech of the next US pres­i­dent, or the film that will win the Os­car in 2048), the most sen­si­ble op­tion is (iii). Op­tion (i) can pro­duce an out­come that is max­i­mum like­li­hood but is ex­tremely un­typ­i­cal (since ev­ery in­di­vi­d­ual out­come has very low prob­a­bil­ity), so it is not very use­ful. Op­tion (ii) re­quires some­how pro­duc­ing an ex­po­nen­tially large vec­tor of num­bers, so it’s in­fea­si­ble. More so­phis­ti­cated var­i­ants are pos­si­ble, but I don’t think any of them avoids the prob­lem.

If the Pre­dict-O-Matic is a Bayesian in­fer­ence al­gorithm, an in­ter­est­ing dy­namic will re­sult. On each round, some hy­poth­e­sis will be sam­pled out of the cur­rent be­lief state. If this hy­poth­e­sis is a self-fulfilling prophecy, sam­pling it will cause its like­li­hood to go up. We get pos­i­tive feed­back: the higher the prob­a­bil­ity Pre­dict-O-Matic as­signs to the hy­poth­e­sis, the more of­ten it is sam­pled, the more ev­i­dence in fa­vor of the hy­poth­e­sis is pro­duced, the higher its prob­a­bil­ity be­comes. So, if it starts out as suffi­ciently prob­a­ble a pri­ori, the be­lief state will con­verge there.

Of course re­al­is­tic learn­ing al­gorithms are not Bayesian in­fer­ence, but they have to ap­prox­i­mate Bayesian in­fer­ence in some sense. At the least, there has to be some large space of hy­pothe­ses s.t. if one of them is true, the al­gorithm will con­verge there. Any al­gorithm with this prop­erty prob­a­bly dis­plays the dy­nam­ics above.

Now, to the sim­ple model. In this model we have just two out­comes: A and B (so it’s not large vol­ume data, but that doesn’t mat­ter). On each round a pre­dic­tion is made, af­ter which some out­come oc­curs. The true en­vi­ron­ments works as fol­lows: if pre­dic­tion “A” is made, on this round A hap­pens with prob­a­bil­ity 99% and B with prob­a­bil­ity 1%. If pre­dic­tion “B” is made, on this round B hap­pens with prob­a­bil­ity 100%. Of course Pre­dict-O-Matic is not aware that pre­dic­tions can in­fluence out­comes. In­stead, we will as­sume Pre­dict-O-Matic is do­ing Bayesian in­fer­ence with a prior over hy­pothe­ses, each of which as­sumes that the en­vi­ron­ment is IID. In other words, it is learn­ing a sin­gle pa­ram­e­ter which is the prob­a­bil­ity A will oc­cur on any given round.

Claim: If the prior is s.t. any in­ter­val in -space is as­signed pos­i­tive prob­a­bil­ity, then Pre­dict-O-Matic will con­verge to pre­dict­ing B with fre­quency 1.

Sketch of proof: If Pre­dict-O-Matic con­verges to pre­dict­ing B with fre­quency then the en­vi­ron­ment con­verges to pro­duc­ing out­come B with fre­quency , im­ply­ing that Pre­dict-O-Matic con­verges to pre­dict­ing B with fre­quency .

Re­mark 2: Some of the hy­pothe­ses in the prior might be in­tel­li­gent agents in their own right, with their own util­ity func­tions. Such an agent can in­ten­tion­ally pro­duce cor­rect pre­dic­tions to in­crease its prob­a­bil­ity in the be­lief state, un­til a “treach­er­ous turn” point when it pro­duces a pre­dic­tion de­signed to have ir­re­versible con­se­quences in the out­side world in fa­vor of the agent. If it is not a self-fulfilling prophecy, this treach­er­ous pre­dic­tion will cause Pre­dict-O-Matic to up­date against the agen­tic hy­poth­e­sis, but it might be too late. If it is a self-fulfilling prophecy, it will only make this hy­poth­e­sis even stronger.

More­over, there is a mechanism that sys­tem­at­i­cally pro­duces such agen­tic hy­pothe­ses. Namely, a suffi­ciently pow­er­ful pre­dic­tor is likely to run into “simu­la­tion hy­pothe­ses” i.e. hy­pothe­ses that claim the uni­verse is a simu­la­tion by some other agent. As Chris­ti­ano ar­gued be­fore, that opens an at­tack vec­tor for pow­er­ful agents across the mul­ti­verse to ma­nipu­late Pre­dict-O-Matic into mak­ing what­ever pre­dic­tions they want (as­sum­ing Pre­dict-O-Matic is suffi­ciently pow­er­ful to guess what pre­dic­tions those agents would want it to make).

• I dis­agree that self-un­aware­ness /​ du­al­ism should be the de­fault as­sump­tion, for rea­sons I ex­plained in this com­ment. In fact I think that mak­ing a sys­tem that know­ably re­mains self-un­aware through ar­bi­trary in­creases in knowl­edge and ca­pa­bil­ity would be a gi­ant leap to­wards solv­ing AI al­ign­ment. I have vague spec­u­la­tive ideas for how that might be done with a type-check­ing proof, again see that com­ment I linked.

• So most an­i­mals don’t seem very in­tro­spec­tive. Ma­chine learn­ing al­gorithms haven’t shown spon­ta­neous ca­pac­ity for in­tro­spec­tion (so far, that I know of). But hu­mans can in­tro­spect. Maybe a crux here is some­thing along the lines of: Hu­mans have ca­pa­bil­ity for in­tro­spec­tion. They’re also smarter than an­i­mals. Maybe once our ML al­gorithms get good enough, the ca­pac­ity for in­tro­spec­tion will spon­ta­neously arise.

Peo­ple should be think­ing about this pos­si­bil­ity. But we also have ML al­gorithms which are in some ways su­per­hu­man and like I said I know no in­stances of spon­ta­neous emer­gence of in­tro­spec­tion. It seems like a rea­son­ably likely pos­si­bil­ity to me that “in­tel­li­gence” of the sort needed for cross-do­main su­per­hu­man pre­dic­tion abil­ity and spon­ta­neous emer­gence of in­tro­spec­tion are, in fact, or­thog­o­nal axes.

In terms of non-spon­ta­neous emer­gence of in­tro­spec­tion, that’s ba­si­cally meta-learn­ing. I agree meta-learn­ing is prob­a­bly su­per im­por­tant for the fu­ture of AI. In fact, come to think of it, I won­der if the rea­son why hu­mans are both smart and in­tro­spec­tive is be­cause our brains evolved some ad­di­tional meta-learn­ing ca­pa­bil­ities! And I agree that your idea of hav­ing some kind of fire­wall be­tween in­tro­spec­tion and ob­ject-level re­al­ity mod­els could help pre­vent prob­lems. I’ve spent a lit­tle while think­ing con­cretely about how this could work.

(Hop­ing to run a differ­ent com­pe­ti­tion re­lated to these is­sues at some time in the fu­ture. Was think­ing it would be big­ger and longer—please PM me if you want to help con­tribute to the prize pool.)

• I am us­ing the term “self-aware” to mean “know­ing that one ex­ists in the world and can af­fect the world”, in which case an­i­mals, RL robots, etc., are all triv­ially self-aware. You seem to be us­ing the term “in­tro­spec­tive” for some­thing be­yond mere self-aware­ness—maybe “hav­ing con­cepts in the world-model that are suffi­ciently gen­eral that they ap­ply to both the out­side world and one’s in­ter­nal in­for­ma­tion-pro­cess­ing”. Some­thing like that? You can tell me.

So let’s take these two lev­els, self-aware­ness (“I ex­ist and can af­fect the world”) and in­tro­spec­tion (“Why am I think­ing about that? I seem to have an as­so­ci­a­tive mem­ory!”)

As I read the OP, it seems to me that self-aware­ness is the rele­vant thresh­old you rely on, not in­tro­spec­tion. (Do you agree?) I do think that self-aware­ness is what you need for pow­er­ful safety guaran­tees, and that we should study the pos­si­bil­ity of self-un­aware sys­tems, even if it’s not guaran­teed to be pos­si­ble.

As for in­tro­spec­tion, I do in fact think that any AI sys­tem which can de­velop deep, gen­eral, mechanis­tic un­der­stand­ings of things in the world, and which is self-aware at all, will go be­yond mere self-aware­ness to de­velop deep in­tro­spec­tion. My rea­son is that such AIs will have a gen­eral ca­pa­bil­ity to find un­der­ly­ing pat­terns, and thus will dis­cover an anal­ogy be­tween its own thoughts and ac­tions and those of oth­ers. Do­ing that just doesn’t seem fun­da­men­tally differ­ent from, say, dis­cov­er­ing the law of grav­i­ta­tion by dis­cov­er­ing an anal­ogy be­tween the be­hav­ior of planets ver­sus ap­ples (which in turn is harder but not fun­da­men­tally differ­ent from know­ing how to twist off a bot­tle cap by dis­cov­er­ing an anal­ogy with pre­vi­ous bot­tle caps that one has used). Thus, I think that the only way to pre­vent an ar­bi­trar­ily in­tel­li­gent world-mod­el­ing AI from de­vel­op­ing ar­bi­trar­ily deep in­tro­spec­tive un­der­stand­ing, is to build the sys­tem to have no self-aware­ness in the first place.

• Study­ing the pos­si­bil­ity of self-aware sys­tems seems like a good idea, but I have a feel­ing most ways to achieve this will be brit­tle. My ob­jec­tive with this post was to get crisp sto­ries for why self-aware pre­dic­tive sys­tems should be con­sid­ered dan­ger­ous.

My rea­son is that such AIs will have a gen­eral ca­pa­bil­ity to find un­der­ly­ing pat­terns, and thus will dis­cover an anal­ogy be­tween its own thoughts and ac­tions and those of oth­ers.

Let’s taboo in­tro­spec­tion for a minute. Sup­pose the AI does dis­cover some un­der­ly­ing pat­ters and analo­gize the piece of mat­ter in which it is en­cased with the thoughts and ac­tions of its hu­man op­er­a­tor. Not only that, it finds analo­gies be­tween other com­put­ers and its hu­man op­er­a­tor, be­tween its hu­man op­er­a­tor and other com­put­ers, etc. Why pre­cisely is this a prob­lem?

• I wouldn’t ar­gue that self-aware sys­tems are au­to­mat­i­cally dan­ger­ous, but rather that self-un­aware sys­tems are au­to­mat­i­cally safe (or at least com­par­a­tively pretty safe).

More speci­fi­cally: Most peo­ple in AI safety, most of the time, are talk­ing about self-aware (in my min­i­mal sense of tak­ing pur­pose­ful ac­tions etc.) agent-like sys­tems. I don’t think such sys­tems are au­to­mat­i­cally dan­ger­ous, but they do ne­ces­si­tate solv­ing the al­ign­ment prob­lem, and since we haven’t solved the al­ign­ment prob­lem yet, I think it’s worth spend­ing time ex­plor­ing al­ter­na­tive ap­proaches.

If you’re mak­ing a pre­dic­tion sys­tem (or an or­a­cle more gen­er­ally), there seems to be a pos­si­bil­ity of mak­ing it self-un­aware—it doesn’t know that it’s out­putting pre­dic­tions, it doesn’t know that it even has an out­put, it doesn’t know that it ex­ists in the uni­verse, etc. A toy ex­am­ple is a su­per­hu­man world-model which is com­pletely and eas­ily in­ter­pretable; you can just look at the data struc­ture and un­der­stand ev­ery as­pect of it, see what the con­cepts are and how they’re con­nected, and you can use that to ex­plore coun­ter­fac­tu­als and un­der­stand things etc. That data struc­ture is the whole sys­tem, and the hu­man users browse it. Any­way, I think the scariest safety risk for or­a­cles is that they’ll give ma­nipu­la­tive an­swers, use side-chan­nel at­tacks, or more gen­er­ally make in­tel­li­gent de­ci­sions to steer the fu­ture to­wards goals. A self-un­aware sys­tem will not do that be­cause it is not aware that it can do things to af­fect the uni­verse. There’s still some safety prob­lems (not to men­tion bad ac­tors etc.), but sig­nifi­cantly less scary ones.

• I wouldn’t ar­gue that self-aware sys­tems are au­to­mat­i­cally dan­ger­ous, but rather that self-un­aware sys­tems are au­to­mat­i­cally safe (or at least com­par­a­tively pretty safe).

Fair enough.

Most peo­ple in AI safety, most of the time, are talk­ing about self-aware (in my min­i­mal sense of tak­ing pur­pose­ful ac­tions etc.) agent-like sys­tems. I don’t think such sys­tems are au­to­mat­i­cally dan­ger­ous, but they do ne­ces­si­tate solv­ing the al­ign­ment prob­lem, and since we haven’t solved the al­ign­ment prob­lem yet, I think it’s worth spend­ing time ex­plor­ing al­ter­na­tive ap­proaches.

I sus­pect the im­por­tant part is the agent-like part.

I’m not sure it makes to think of “the al­ign­ment prob­lem” as a sin­gu­lar­ity en­tity. I’d rather taboo “the al­ign­ment prob­lem” and just ask what could go wrong with a self-aware sys­tem that’s not agent-like.

A self-un­aware sys­tem will not do that be­cause it is not aware that it can do things to af­fect the uni­verse.

Hot take: it might be use­ful to think of “self-aware­ness” and “aware­ness that it can do things to af­fect the uni­verse” sep­a­rately. Not sure they are one and the same.

• In other words, if the Pre­dict-O-Matic knows it will pre­dict P = A, it as­signs prob­a­bil­ity 1 to the propo­si­tion that it will pre­dict P = A.

It’s a pre­dic­tor—it pro­duces prob­a­bil­ities (or ex­pected value?). There’s also some rules about prob­a­bil­ity that it might fol­low—like if asked to guess the prob­a­bil­ity it rains next wednes­day, it will give the same an­swer as if asked to guess the prob­a­bil­ity it will give when asked to­mor­row.

• One pos­si­bil­ity is that it’s able to find a use­ful out­side view model such as “the Pre­dict-O-Matic has a his­tory of mak­ing nega­tive self-fulfilling prophe­cies”. This could lead to the Pre­dict-O-Matic mak­ing a nega­tive prophecy (“the Pre­dict-O-Matic will con­tinue to make nega­tive prophe­cies which re­sult in ter­rible out­comes”), but this prophecy wouldn’t be se­lected for be­ing self-fulfilling. And we might use­fully ask the Pre­dict-O-Matic whether the ter­rible self-fulfilling prophe­cies will con­tinue con­di­tional on us tak­ing Ac­tion A.

Maybe I mi­s­un­der­stood what you mean by du­al­ism, but I don’t think that’s true. Say the Pre­dict-O-Matic has an out­side view model (of it­self) like “The metal box on your desk (the Pre­dict-O-Matic) will make a self-ful­lfilling prophecy that max­i­mizes the num­ber of pa­per­clips”. Then you ask it how likely it is that your digi­tal records will sur­vive for 100 years. It no­tices that that de­pends sig­nifi­cantly on how much effort you make to se­cure them. It no­tices that that sig­nifi­cantly de­pends on what the metal box on your desk tells you. It uses it’s low-model re­s­olu­tion of what the box says. To work that out, it checks which out­puts would be self-fulfilling, and then which of these leads to the most pa­per­clips. The more un­se­cure your digi­tal records are, the more you will in­vest in pa­per, and the more pa­per­clips you will need. There­fore the metal box will tell you the low­est self-fulfilling propa­bil­ity for your ques­tion. Since that num­ber is *self-fulfilling*, it is in fact the cor­rect an­swer, and the Pre­dict-O-Matic will an­swer with it.

I think this avoids your ar­gu­ment that

I con­tend that Pre­dict-O-Matic doesn’t know it will pre­dict P = A at the rele­vant time. It would re­quire time travel—to know whether it will pre­dict P = A, it will have to have made a pre­dic­tion already, and but it’s still for­mu­lat­ing its pre­dic­tion as it thinks about what it will pre­dict.

be­cause it doesn’t have to simu­late it­self in de­tail to know what the metal box (it) will do. The low-re­s­olu­tion model pro­vides a short­cut around that, but it will be ac­cu­rate de­spite the low re­s­olu­tion, be­cause by be­liev­ing it is sim­ple, it be­comes sim­ple.

Can you use­fully ask for con­di­tion­als? Maybe. The an­swer to the con­di­tional de­pends on what wor­lds you are likely to take Ac­tion A in. It might be that in most wor­lds where you do A, you do it be­cause of a pre­dic­tion from the metal box, and since we know those max­i­mize pa­per­clips, there’s a good chance the ac­tion will fail to pre­vent it in those cricum­stances. But if that’s not the case, for ex­am­ple be­cause it’s cer­tain you won’t ask the box any more ques­tions be­tween this one and the event it tries to pre­dict.

It might be pos­si­ble to avoid any prob­lems of this sort by only ever ask­ing ques­tions of the type “Will X hap­pen if I do Y now (with no time to re­ceive new info be­tween hear­ing the pre­dic­tion and do­ing the ac­tion)?”, be­cause by back­wards in­duc­tion the cor­rect an­swer will not de­pend on what you ac­tu­ally do. This doesn’t avoid the sce­nar­ios on the origi­nal where mul­ti­ple peo­ple act on their Pre­dict-O-Mat­ics, but I sus­pect these aren’t solv­able with­out co­or­di­na­tion.