Self-Fulfilling Prophecies Aren’t Always About Self-Awareness

This is a be­lated fol­low-up to my Dual­ist Pre­dict-O-Matic post, where I share some thoughts re: what could go wrong with the du­al­ist Pre­dict-O-Matic.

Belief in Su­per­pre­dic­tors Could Lead to Self-Fulfilling Prophecies

In my pre­vi­ous post, I de­scribed a Pre­dict-O-Matic which mostly mod­els the world at a fuzzy re­s­olu­tion, and only “zooms in” to model some part of the world in greater re­s­olu­tion if it thinks know­ing the de­tails of that part of the world will im­prove its pre­dic­tion. I con­sid­ered two cases: the case where the Pre­dict-O-Matic sees fit to model it­self in high re­s­olu­tion, and the case where it doesn’t, and just makes use of a fuzzier “out­side view” model of it­self.

What sort of out­side view mod­els of it­self might it use? One pos­si­ble model is: “I’m not sure how this thing works, but its pre­dic­tions always seem to come true!”

If the Pre­dict-O-Matic some­times does fore­cast­ing in non-tem­po­ral or­der, it might first figure out what it thinks will hap­pen, then use that to figure out what it thinks its in­ter­nal fuzzy model of the Pre­dict-O-Matic will pre­dict.

And if it some­times re­vis­its as­pects of its fore­cast to make them con­sis­tent with other as­pects of its fore­cast, it might say: “Hey, if the Pre­dict-O-Matic fore­casts X, that will cause X to no longer hap­pen”. So it figures out what would ac­tu­ally hap­pen if X gets fore­casted. Call that X’. Sup­pose X != X’. Then the new fore­cast has the Pre­dict-O-Matic pre­dict­ing X and then X’ hap­pens. That can’t be right, be­cause out­side view says the Pre­dict-O-Matic’s pre­dic­tions always come true. So we’ll have the Pre­dict-O-Matic pre­dict­ing X’ in the fore­cast in­stead. But wait, if the Pre­dict-O-Matic pre­dicts X’, then X″ will hap­pen. Etc., etc. un­til a fixed point is found.

Some com­menters on my pre­vi­ous post talked about how mak­ing the Pre­dict-O-Matic self-un­aware could be helpful. Note that self-aware­ness doesn’t ac­tu­ally help with this failure mode, if the Pre­dict-O-Matic knows about (or fore­casts the de­vel­op­ment of) any­thing which can be mod­eled us­ing the out­side view “I’m not sure how this thing works, but its pre­dic­tions always seem to come true!” So the prob­lem here is not self-aware­ness. It’s be­lief in su­per­pre­dic­tors, com­bined with a par­tic­u­lar fore­cast­ing al­gorithm: we’re up­dat­ing our be­liefs in a cyclic fash­ion, or hill-climb­ing our story of how the fu­ture will go un­til the story seems plau­si­ble, or some­thing like that.

Be­fore propos­ing a solu­tion, it’s of­ten valuable to deepen your un­der­stand­ing of the prob­lem.

Glitchy Pre­dic­tor Si­mu­la­tion Could Step Towards Fixed Points

Let’s go back to the case where the Pre­dict-O-Matic sees fit to model it­self in high re­s­olu­tion and we get an in­finite re­curse. Ex­actly what’s go­ing to hap­pen in that case?

I ac­tu­ally think the an­swer isn’t quite ob­vi­ous, be­cause al­though the Pre­dict-O-Matic has limited com­pu­ta­tional re­sources, its in­ter­nal model of it­self also has limited com­pu­ta­tional re­sources. And its in­ter­nal model’s in­ter­nal model of it­self has limited com­pu­ta­tional re­sources too. Etc.

Sup­pose Pre­dict-O-Matic is im­ple­mented in a re­ally naive way where it just crashes if it runs out of com­pu­ta­tional re­sources. If the toplevel Pre­dict-O-Matic has ac­cu­rate be­liefs about its available com­pute, then we might see the toplevel Pre­dict-O-Matic crash be­fore any of the simu­lated Pre­dict-O-Mat­ics crash. Si­mu­lat­ing some­thing which has the same amount of com­pute you do can eas­ily use up all your com­pute!

But sup­pose the Pre­dict-O-Matic un­der­es­ti­mates the amount of com­pute it has. Maybe there’s some ev­i­dence in the en­vi­ron­ment which mis­leads it to think that it has less com­pute than it ac­tu­ally does. So it simu­lates a re­stricted-com­pute ver­sion of it­self rea­son­ably well. Maybe that re­stricted-com­pute ver­sion of it­self is mis­lead in the same way, and simu­lates a dou­ble-re­stricted-com­pute ver­sion of it­self.

Maybe this all hap­pens in a way so that the first Pre­dict-O-Matic in the hi­er­ar­chy to crash is near the bot­tom, not the top. What then?

Deep in the hi­er­ar­chy, the Pre­dict-O-Matic simu­lat­ing the crashed Pre­dict-O-Matic makes pre­dic­tions about what hap­pens in the world af­ter the crash.

Then the Pre­dict-O-Matic simu­lat­ing that Pre­dict-O-Matic makes a pre­dic­tion about what hap­pens in a world where the Pre­dict-O-Matic pre­dicts what­ever would hap­pen af­ter a crashed Pre­dict-O-Matic.

Then the Pre­dict-O-Matic simu­lat­ing that Pre­dict-O-Matic makes a pre­dic­tion about what hap­pens in a world where the Pre­dict-O-Matic pre­dicts [what hap­pens in a world where the Pre­dict-O-Matic pre­dicts what­ever would hap­pen af­ter a crashed Pre­dict-O-Matic].

Then the Pre­dict-O-Matic simu­lat­ing that Pre­dict-O-Matic makes a pre­dic­tion about what hap­pens in a world where the Pre­dict-O-Matic pre­dicts [what hap­pens in a world where the Pre­dict-O-Matic pre­dicts [what hap­pens in a world where the Pre­dict-O-Matic pre­dicts what­ever would hap­pen af­ter a crashed Pre­dict-O-Matic]].

Pre­dict­ing world gets us world’, pre­dict­ing world’ gets us world″, pre­dict­ing world″ gets us world‴… Every layer in the hi­er­ar­chy takes us one step closer to a fixed point.

Note that just like the pre­vi­ous sec­tion, this failure mode doesn’t de­pend on self-aware­ness. It just de­pends on be­liev­ing in some­thing which be­lieves it self-simu­lates.

Re­peated Use Could Step Towards Fixed Points

Another way the Pre­dict-O-Matic can step to­wards fixed points is through sim­ple re­peated use. Sup­pose each time af­ter mak­ing a pre­dic­tion, the Pre­dict-O-Matic gets up­dated data about how the world is go­ing. In par­tic­u­lar, the Pre­dict-O-Matic knows the most re­cent pre­dic­tion it made and can fore­cast how hu­mans will re­spond to that. Then when the hu­mans ask it for a new pre­dic­tion, it in­cor­po­rates the fact of its pre­vi­ous pre­dic­tion into its fore­cast and gen­er­ates a new pre­dic­tion. You can imag­ine a sce­nario where the op­er­a­tors keep ask­ing the Pre­dict-O-Matic the same ques­tion over and over again, get­ting a differ­ent an­swer ev­ery time, try­ing to figure out what’s go­ing wrong—un­til fi­nally the Pre­dict-O-Matic be­gins to con­sis­tently give a par­tic­u­lar an­swer—a fixed point it has in­ad­ver­tently dis­cov­ered.

As Abram al­luded to in one of his com­ments, the Pre­dict-O-Matic might even forsee this en­tire pro­cess hap­pen­ing, and im­me­di­ately fore­cast the fixed point cor­re­spond­ing to the end state. Though, if the fore­cast is de­tailed enough, we’ll get to see this en­tire pro­cess hap­pen­ing within the fore­cast, which could al­low us to avoid an un­wanted out­come.

This one doesn’t seem to de­pend on self-aware­ness ei­ther. Con­sider two Pre­dict-O-Mat­ics with no self-knowl­edge what­so­ever (not even the du­al­ist kind I dis­cussed in my pre­vi­ous post). If they’re get­ting in­formed about the pre­dic­tions the other is mak­ing, they could in­ad­ver­tently work to­gether to step to­wards fixed points.

Solutions

An idea which could ad­dress some of these is­sues: Ask the Pre­dict-O-Matic to make pre­dic­tions con­di­tional on us ig­nor­ing its pre­dic­tions and not tak­ing any ac­tion. Per­haps we’d also want to spec­ify that any ex­ist­ing or fu­ture su­per­pre­dic­tors will also be ig­nored in this hy­po­thet­i­cal.

Then if we ac­tu­ally want to do some­thing about the prob­lems the Pre­dict-O-Matic forsees, we can ask it to pre­dict how the world will go con­di­tional on us tak­ing some par­tic­u­lar ac­tion.

Choos­ing bet­ter in­fer­ence al­gorithms could also be helpful.

Prize

Sorry I was slower than planned on writ­ing this fol­low-up and choos­ing a win­ner. I’ve de­cided to give Bun­thut a \$110 prize (in­clud­ing \$10 in­ter­est for my slow fol­low-up). Thanks ev­ery­one for your in­sights.

• Planned sum­mary:

Could we pre­vent a su­per­in­tel­li­gent or­a­cle from mak­ing self-fulfilling prophe­cies by pre­vent­ing it from mod­el­ing it­self? This post pre­sents three sce­nar­ios in which self-fulfilling prophe­cies would still oc­cur. For ex­am­ple, if in­stead of mod­el­ing it­self, it mod­els the fact that there’s some AI sys­tem whose pre­dic­tions fre­quently come true, it may try to pre­dict what that AI sys­tem would say, and then say that. This would lead to self-fulfilling prophe­cies.
• This is good stuff!

...if the Pre­dict-O-Matic knows about (or fore­casts the de­vel­op­ment of) any­thing which can be mod­eled us­ing the out­side view “I’m not sure how this thing works, but its pre­dic­tions always seem to come true!”

Can you walk through the ar­gu­ment here in more de­tail? I’m not sure I fol­low it; sorry if I’m be­ing stupid.

I’ll start: There’s two iden­ti­cal sys­tems, “Pre­dict-O-Matic A” and “Pre­dict-O-Matic B”, sit­ting side-by-side on a table. For sim­plic­ity let’s say that A knows ev­ery­thing about B, B knows ev­ery­thing about A, but A is to­tally oblivi­ous to the ex­is­tence of A, and B to B. Then what? What’s a ques­tion you might you ask it that would be prob­le­matic? Thanks in ad­vance!

• This is good stuff!

Thanks!

Here’s an­other at­tempt at ex­plain­ing.

1. Sup­pose Pre­dict-O-Matic A has ac­cess to his­tor­i­cal data which sug­gests Pre­dict-O-Matic B tends to be ex­tremely ac­cu­rate, or oth­er­wise has rea­son to be­lieve Pre­dict-O-Matic B is ex­tremely ac­cu­rate.

2. Sup­pose the way Pre­dict-O-Matic A makes pre­dic­tions is by some pro­cess analo­gous to writ­ing a story about how things will go, eval­u­at­ing the plau­si­bil­ity of the story, and do­ing simu­lated an­neal­ing or some other sort of stochas­tic hill-climb­ing on its story un­til the plau­si­bil­ity of its story is max­i­mized.

3. Sup­pose that it’s over­whelm­ingly plau­si­ble that at some time in the near fu­ture, Im­por­tant Per­son is go­ing to walk up to Pre­dict-O-Matic B and ask Pre­dict-O-Matic B for a fore­cast and make an im­por­tant de­ci­sion based on what Pre­dict-O-Matic B says.

4. Be­cause of point 3, sto­ries which don’t in­volve a fore­cast from Pre­dict-O-Matic B will tend to get re­jected dur­ing the hill-climb­ing pro­cess. And...

• Be­cause of point 1, sto­ries which in­volve an in­ac­cu­rate fore­cast from Pre­dict-O-Matic B will tend to get re­jected dur­ing the hill-climb­ing pro­cess. We will tend to hill-climb our way into hav­ing Pre­dict-O-Matic B’s pre­dic­tion change so it matches what ac­tu­ally hap­pens in the rest of the story.

• Be­cause the per­son in point 3 is im­por­tant and Pre­dict-O-Matic B’s fore­cast in­fluences their de­ci­sion, a change to the part of the story re­gard­ing Pre­dict-O-Matic B’s pre­dic­tion could eas­ily mean the rest is no longer plau­si­ble and will benefit from re­vi­sion.

• So now we’ve got a loop in the hill-climb­ing pro­cess where changes in Pre­dict-O-Matic B’s fore­cast lead to changes in what hap­pens af­ter Pre­dict-O-Matic B’s fore­cast, and changes in what hap­pens af­ter Pre­dict-O-Matic B’s fore­cast lead to changes in Pre­dict-O-Matic B’s fore­cast. It stops when we hit a fixed point.

Now that I’ve writ­ten this out, I’m re­al­iz­ing that I don’t think this would hap­pen for sure. I’ve ar­gued both that chang­ing the fore­cast to match what hap­pens will im­prove plau­si­bil­ity, and that chang­ing what hap­pens so it’s a plau­si­ble re­sult of the fore­cast will im­prove plau­si­bil­ity. But if the only way to achieve one is by dis­card­ing the other, I guess both tweaks won’t cause im­prove­ments to plau­si­bil­ity in gen­eral. How­ever, the point re­mains that a fixed point will be among the most plau­si­ble sto­ries available, so any good op­ti­miza­tion method will tend to con­verge on it. (Maybe just simu­lated an­neal­ing, but with a tem­per­a­ture pa­ram­e­ter high enough that it finds easy to leap be­tween these kind of semi-plau­si­ble sto­ries un­til it gets a fixed point by chance. Or if we’re do­ing hill climb­ing based on lo­cal im­prove­ments in plau­si­bil­ity in­stead of con­sid­er­ing plau­si­bil­ity when taken as a whole.)

I think the sce­nario is similar to your P(X) and P(Y) dis­cus­sion in this post.

It just now oc­curred to me that you could get a similar effect given cer­tain im­ple­men­ta­tions of beam search. Sup­pose we’re do­ing beam search with a beam width of 1 mil­lion. For the sake of sim­plic­ity, sup­pose that when Im­por­tant Per­son walks up to Pre­dict-O-Matic B and asks their ques­tion in A’s sim, each of 1M beam states gets al­lo­cated to a differ­ent re­sponse that Pre­dict-O-Matic B could give. Some of those states lead to “in­co­her­ent”, low-prob­a­bil­ity sto­ries where Pre­dict-O-Matic B’s fore­cast turns out to be false, and they get pruned. The only states left over are states where Pre­dict-O-Matic B’s prophecy ends up be­ing cor­rect—cases where Pre­dict-O-Matic B made a self-fulfilling prophecy.

• Ah, OK, I buy that, thanks. What about the idea of build­ing a sys­tem that doesn’t model it­self or its past pre­dic­tions, and ask­ing it ques­tions that don’t en­tail mod­el­ing any other su­per­pre­dic­tors? (Like “what’s the like­liest way for a per­son to find the cure for Alzheimers, if we hy­po­thet­i­cally lived in a world with no su­per­pre­dic­tors or AGIs?”)

• Could work.

• Etc., etc. un­til a fixed point is found.

“Min­i­mize pre­dic­tion er­ror” could mean min­i­miz­ing er­ror across the set of pre­dic­tions, in­stead of in­di­vi­d­u­ally.

• I think it’s rel­a­tively straight­for­ward to avoid that if you con­struct your sys­tem well.