Stories of Continuous Deception

In my re­cent posts, I con­sid­ered sce­nar­ios where an AI re­al­izes that it would be in­stru­men­tally use­ful to de­ceive hu­mans (about its al­ign­ment or ca­pa­bil­ities) when weak, then un­der­take a treach­er­ous turn when hu­mans are no longer a threat. Those sce­nar­ios have the fol­low­ing (im­plicit) as­sump­tions:

  • i) We’re con­sid­er­ing a seed AI able to re­cur­sively self-im­prove with­out hu­man in­ter­ven­tion.

  • ii) There is some dis­con­ti­nu­ity at the con­cep­tion of de­cep­tion, i.e. when it first thinks of its treach­er­ous turn plan.

This dis­con­ti­nu­ity could be fol­lowed by a mo­ment of vuln­er­a­bil­ity where it isn’t re­ally good at con­ceal­ing its in­ten­tions (hu­mans could de­tect its mis­al­ign­ment). Thus, ac­cord­ing to the sor­did stum­ble view, it would “be­have in a way that re­veals its hu­man-un­de­sir­able val­ues to hu­mans be­fore it gains the ca­pa­bil­ity to de­ceive hu­mans into be­liev­ing that it has hu­man-de­sir­able val­ues”.

In this post, I’ll pre­sent grad­ual de­cep­tion sto­ries where, even with­out as­sump­tions i) and ii), the AI con­tin­u­ously learns to de­ceive hu­man, hence con­sti­tut­ing coun­terex­am­ples for the sor­did stum­ble view.

The Un­bi­ased News­feed is Bi­ased Towards You

Hu­mans are bi­ased to­wards story closer to their be­liefs, as they es­ti­mate that those are more likely to be true. Now, let’s imag­ine a Ma­chine Learn­ing model with the goal of “ag­gre­gat­ing sto­ries into an un­bi­ased news­feed for a hu­man H”, the hu­man pro­vid­ing a bias score for each story.

By do­ing so, the hu­man is un­for­tu­nately spec­i­fy­ing “try to sound un­bi­ased, tak­ing into ac­count my prej­u­dice against sto­ries with high in­fer­en­tial dis­tance”.

At the be­gin­ning, the AI doesn’t re­ally know what con­sti­tutes an un­bi­ased news­feed, so its bias score is high. At some point, it stum­bles upon a story that ap­pears un­bi­ased but is ac­tu­ally bi­ased to­ward H, and up­dates its pa­ram­e­ters in this di­rec­tion.

After mul­ti­ple steps, the model mostly out­puts news that look un­bi­ased but are bi­ased to­ward H. It is now de­ceiv­ing the hu­man into the trust­wor­thi­ness of its news­feed, with­out even re­al­iz­ing it.

At this point, an­other phe­nomenon hap­pens: be­cause H is con­stantly re­ceiv­ing news from the model, its be­liefs start shift­ing to­ward the av­er­age opinion of the news pro­duced by the model.


If the same news­feed model is now serv­ing mil­lions of peo­ple, and for each in­di­vi­d­ual it tries to min­i­mize a bias score, then it is likely to stum­ble upon hu­mans that are more sus­cep­ti­ble to be de­ceived. Fi­nally, if it gen­er­al­izes cor­rectly, it will end up de­ceiv­ing most peo­ple.

Dumb­ing Down Com­plex Plans

Let’s con­sider an AI built to ex­plain plans. Similarly, the model is re­warded by a nu­mer­i­cal hu­man feed­back, es­ti­mate of how much the hu­man un­der­stood from the ex­pla­na­tion.

Now, let’s also as­sume that AIs that turn out to be smart too fast are un­der more surveillance. In that case, the AI might end up dumb­ing out its ex­pla­na­tions be­cause ei­ther:

  • the sur­viv­ing AIs are the one that avoid scrutiny (sur­vival bias).

  • the surveillance code is au­to­mat­i­cally trig­gered when the AI ap­pears smart, in­ter­rupt­ing other pro­cesses, so it would be too costly for the AI (in terms of com­pute time) to ap­pear smart.

Note that the AI is not pre­tend­ing to be dumb. We’re talk­ing about mostly sim­ple ma­chines any­way. The “dumb” is our ex­ter­nal judg­ment of what it’s do­ing.

It Was Your Idea

De­cep­tion can hap­pen when:

  • (i) there is a mis­al­ign­ment be­tween two agents.

  • (ii) even though this is not agent’s B in­ten­tion, agent A is re­warded by agent B, in prac­tice, for mis­di­rec­tion/​de­cep­tion.

For in­stance, de­cep­tion nat­u­rally hap­pens when a cou­ple is tired of ar­gu­ing: one part­ner starts pre­sent­ing his/​her de­ci­sions as be­ing an origi­nal idea from the other.

More gen­er­ally, an or­a­cle AI aimed at shap­ing hu­man de­ci­sion mak­ing would be in­cen­tivized to pre­sent the de­ci­sions as be­ing “close to what the hu­man query­ing the or­a­cle would guess”.

In­deed, even if at the be­gin­ning the AI tries to out­put the most ac­cu­rate an­swers, it will end up (af­ter hu­man feed­back) find­ing that the an­swers that give the most re­ward are the one that make the hu­man be­lieve “it was close to my origi­nal guess af­ter all”.