A Treacherous Turn Timeline—Children, Seed AIs and Predicting AI

I pre­vi­ously wrote about a toy model where a seed AI starts by ex­hibit­ing hu­man-de­sir­able be­hav­iors, be­fore un­der­tak­ing a treach­er­ous turn when hu­mans are no longer a threat. In this post, I’ll link this phe­nomenon to how chil­dren learn to lie and self-con­firm­ing pre­dic­tions.


To mas­ter ly­ing, a child should:

1. Possess the nec­es­sary cog­ni­tive abil­ities to lie (for in­stance, by be­ing able to say words or sen­tences).
2. Un­der­stand that hu­mans can (de­liber­ately) say false­hoods about the world or their be­liefs.
3. Prac­tice ly­ing, al­low­ing him­self/​her­self to be pun­ished if caught.

If lan­guage ac­qui­si­tion flour­ishes when chil­dren are aged 15-18 months, the pro­por­tion of them who lie (about peek­ing in a psy­chol­ogy study) goes from 30% at age 2, to 50% of three-year olds, even­tu­ally reach­ing 80% at eight. Most im­por­tantly, they get bet­ter as they get older, go­ing from blatant lies to pre­tend­ing to be mak­ing rea­son­able/​hon­est guesses.

There is there­fore a gap be­tween the mo­ment chil­dren could (in the­ory) lie (18 months) and the mo­ment they can effec­tively lie and use this tech­nique to their own ad­van­tage (8 years old). Dur­ing this gap, par­ents can cor­rect the kid’s moral val­ues through ed­u­ca­tion.

Seed AI

In the seed AI case, de­ceiv­ing could mean ei­ther:

a. Not us­ing its full ca­pa­bil­ities. This could mean not go­ing “all in” to perform well on a task, or en­sur­ing that hu­mans are not able to mea­sure its “in­tel­li­gence” with their cur­rent sen­sors.
b. Con­ceal­ing its true (un­al­igned) in­ten­tions by not car­ry­ing through un­al­igned plans when both (i) {a su­per­vi­sor is watch­ing (cf. ab­sent su­per­vi­sor)} and (ii) {the su­per­vi­sor won’t be dead at the end of the plan} are true.

Be­fore un­der­tak­ing a treach­er­ous turn, a seed AI would need to go through the fol­low­ing steps:

1. No­tice it is be­ing watched by a su­per­vi­sor.
2. Un­der­stand that it can con­trol whether the su­per­vi­sor is watch­ing (e.g. by ob­struct­ing the view of a cam­era or by kil­ling the su­per­vi­sor), and even what the su­per­vi­sor is watch­ing (for in­stance by obfus­cat­ing its code or by in­fluenc­ing what the su­per­vi­sor’s sen­sors are mea­sur­ing).
3. Figure out that it could coun­ter­fac­tu­ally op­ti­mize even more its ob­jec­tive func­tion if this su­per­vi­sor didn’t ex­ist (and there­fore didn’t have to satisfy its prefer­ence).
4. Plan to op­ti­mize its ob­jec­tive func­tion (in the long term) by fak­ing al­ign­ment and perform­ing sub­op­ti­mally for a while.
5. Un­der­take a treach­er­ous turn when hu­mans are no longer a threat (be­cause reached ca­pa­bil­ity thresh­old).

Let’s see how hav­ing for­mal­ized those steps can help us think about de­cep­tion.

Ac­cord­ing to the sor­did stum­ble view, “an AI that lacks hu­man-de­sir­able val­ues will be­have in a way that re­veals its hu­man-un­de­sir­able val­ues to hu­mans be­fore it gains the ca­pa­bil­ity to de­ceive hu­mans into be­liev­ing that it has hu­man-de­sir­able val­ues.”

In other words, an un­al­igned seed AI would start by be­ing re­ally bad at pre­tend­ing it has al­igned val­ues, so hu­mans could just cor­rect the AI’s code on the go (similarly to the “chil­dren learn­ing to lie” case).

Us­ing the above for­mal­ism, the sor­did stum­ble view states that when the seed AI will be think­ing about un­al­igned plans (steps 2. and 3.) with­out be­ing smart enough to do 4. or 5., it will be re­ally bad at ly­ing, thus vuln­er­a­ble.

The core of the dis­agree­ment be­tween Bostrom (treach­er­ous turn) and Go­ertzel (sor­did stum­ble) is about how long steps 2. and 3. will take, and how ob­vi­ous the seed AI’s un­al­ign­ment will look like dur­ing these steps.

Pre­dict­ing AI

A pre­dict­ing AI (spe­cial case of or­a­cle AI) could de­ceive its su­per­vi­sor, by:

a. De­liber­ately ma­nipu­lat­ing hu­mans to let it es­cape (for­mal­ized by an “es­cape” mes­sage/​ac­tion here).
b. Grad­u­ally out­putting self-con­firm­ing pre­dic­tions that are in­creas­ingly ma­nipu­la­tive (as de­scribed in the last para­graph from this post).

In the sec­ond case, the ar­gu­ment (from Stu­art Arm­strong’s post) goes like this:

1. After each ques­tion-an­swer iter­a­tion, the pre­dict­ing AI out­put pre­dic­tions that are more and more in­ac­cu­rate ac­cord­ing to how the world would look like if the pre­dict­ing AI wasn’t there. It grad­u­ally im­proves in a) pre­dict­ing how hu­mans will re­act to its an­swers (un­der­stand­ing the “hu­man part” of the en­vi­ron­ment) b) con­vinc­ing/​ma­nipu­lat­ing hu­mans c) pre­dict­ing things that are true no mat­ter what (be­cause also gets to un­der­stand the “non-hu­man” part of the en­vi­ron­ment).
2. The AI even­tu­ally con­verges to the fixed-point solu­tion of this “feed­back loop” in­ter­ac­tion, which is to out­put self-con­firm­ing pre­dic­tions. At this point, it is so ma­nipu­la­tive that its pre­dic­tions in­clude mes­sages such as “please build a su­per­in­tel­li­gence” or “let me es­cape”.


  • For the chil­dren and the seed AI, there is a gap be­tween the first mo­ment of vuln­er­a­bil­ity (when the agent is bad at de­ceiv­ing or doesn’t even know that de­ceiv­ing is a thing) and when the agent effec­tively con­ceals its true in­ten­tions/​ca­pa­bil­ities.

  • The main un­cer­tainty for the seed AI is how long it will stay un­al­igned with­out effec­tively con­ceal­ing its ca­pa­bil­ities and in­ten­tions (af­ter hav­ing planned a treach­er­ous turn).

  • For the pre­dict­ing AI, the ma­nipu­la­tion/​de­cep­tion oc­curs nat­u­rally and grad­u­ally be­cause of a pre­dic­tive feed­back loop, with­out ne­ces­sit­ing an ini­tial “con­cep­tion of de­cep­tion”.