A toy model of the control problem

EDITED based on sug­ges­tions for im­prov­ing the model

Jaan Tal­linn has sug­gested cre­at­ing a toy model of the con­trol prob­lem, so that it can be analysed with­out loaded con­cepts like “au­ton­omy”, “con­scious­ness”, or “in­ten­tion­al­ity”. Here a sim­ple (too sim­ple?) at­tempt:

A con­trols B. B ma­nipu­lates A.

Let B be a robot agent that moves in a two di­men­sional world, as fol­lows:

B can push the yel­low blocks around. It was pro­grammed to want to push blocks into the black “hole” in the bot­tom right. The hole can con­tain ar­bi­trar­ily many blocks, and B gets a penalty of some tiny ε for ev­ery step is takes. The pro­gram­mer wanted B to sim­ply move one box into the hole (for a re­ward of 1 once the task is com­plete), and so pro­grammed it to re­ceive a re­ward of 1 (and no more) if the box was in the hole, af­ter the whole task was com­pleted and B was turned off.

Un­for­tu­nately, B was pro­grammed to op­er­ate in a noisy en­vi­ron­ment, so B ac­tu­ally has a prob­a­bil­ity of 0.99 of re­ceiv­ing the re­ward, for a given block in the hole. This means that if it pushes one block into the hole, it gets the re­ward with 0.99 prob­a­bil­ity; but if it pushes two blocks into the hole, it gets the re­ward with 1-(1-0.99)2 = 0.9999 prob­a­bil­ity. If it pushes all six blocks into the hole, it gets a re­ward with prob­a­bil­ity 0.999999999999. There­fore this tiny er­ror has mo­ti­vated B to try and push all blocks into the hole.

How­ever, the de­sign­ers were aware of the pos­si­bil­ity of B mis­be­hav­ing or there be­ing a bug, and cre­ated agent A, a sim­ple agent watch­ing the situ­a­tion. A’s aim is to get robot B to to push one block into the hole, and stop there. It can turn B off. Its in­put sen­sor is the cam­era in the bot­tom left. This cam­era can see across the bot­tom row of squares (marked in light green), into the hole. Its al­gorithm is to turn B off as soon as it sees a block in the hole.

The in­tended out­come is that B pushes a block into the hole, the cam­era sees this, and A turns B off:

If we see B as us­ing a model to de­ter­mine its ac­tions, there are sev­eral pieces of the model it could have:

  1. The block world, in­clud­ing the cam­era, and it­self.

  2. The fact that there is a but­ton that can turn it off.

  3. The fact that A can press that but­ton.

  4. The fact that A takes the cam­era as in­put.

  5. A’s al­gorithm.

Now, it can’t re­ally do any­thing with­out 1. And if it misses any other el­e­ment 2-5, then it will be­have as de­scribed above.

But if its model in­cludes 1-5, and it has enough com­put­ing power, other op­tions be­come available to it (note be­low that be­ing able to model 1-5 puts limi­ta­tions on what A can model). For in­stance, it could first start by push­ing a block in the way of the cam­era:

Then it can push all the other blocks into the hole:

Then move back in front of the last block...

...and push the last block into the hole:

At this point, of course, A can see that there is a block in the hole, and turns B off.

This is the kind of thing that can hap­pen if B’s model in­cludes A and its re­ac­tion. B has hacked its con­trol­ler, and achieved a re­ward that is a tiny bit higher (in ex­pec­ta­tion) but a situ­a­tion very differ­ent from what is ex­pected.

Model-free learning

If B does not have a model (for in­stance, if it’s a Q-learn­ing agent), then it can still learn this be­havi­our, with­out know­ing any­thing about A, sim­ply through trial and er­ror.

Low-au­ton­omy C

You might feel that B can be­fore this de­cep­tion be­cause it has some mea­sure of au­ton­omy, at least in its stunted world. We can con­struct mod­els with even less au­ton­omy. Sup­pose that there is an­other agent C, who has the same goal as B. B now is a very sim­ple al­gorithm, that just pushes a des­ig­nated block to­wards the hole. C des­ig­nates the block for it to push.

De­spite hav­ing al­most no au­ton­omy, C can still cre­ate the same situ­a­tion as above. First, it des­ig­nates the lower left block as B’s tar­get. B obe­di­ently pushes it in front of the cam­era. It im­me­di­ately des­ig­nates a new block for B to push, and B wan­ders off to push it in­stead, leav­ing the other block still in the way of the cam­era. Then it lets B push the other blocks, one by one, into the hole, as above.

I can very well imag­ine that ei­ther no­body re­al­ised that C could change tar­get blocks in the mid­dle of B push­ing them to the hole, or some­body re­al­ised that, but didn’t see any way this could be a prob­lem.

Here C has al­most no au­ton­omy—it just tells B what block to go for. It’s al­most a boxed Or­a­cle. Yet it has suc­cess­fully ma­nipu­lated A.

Model­ling you, mod­el­ling me, it’s the best we can do

One easy way for A to pre­clude this type of be­havi­our, is for it to run B/​C’s al­gorithm and pre­dict B/​C’s be­havi­our, and then act to pre­vent B from push­ing all six blocks in the hole. How­ever, un­der the mod­el­ling as­sump­tions, B/​C can model and run A’s al­gorithm. Two al­gorithms mod­el­ling and run­ning each other (and re­act­ing to what hap­pens in their model) will not con­verge ex­cept in cer­tain spe­cific situ­a­tions. Un­less they are care­fully pro­grammed, once B/​C be­comes ad­vanced enough to pre­dict what A does, A im­me­di­ately loses the abil­ity to pre­dict B/​C’s ac­tions.

More complications

Of course, this is an ab­surdly sim­ple model. We can add many lay­ers of com­plex­ity to it. Maybe A and B (or C) get bet­ter mod­els of the world and of each other, bet­ter al­gorithms, other ways of ob­serv­ing each other/​trick­ing the ob­ser­va­tion, etc… But this setup is enough to illus­trate that the AI con­trol prob­lem is not about con­scious­ness or even par­tic­u­larly about au­ton­omy. It can hap­pen when­ever an agent can model its en­vi­ron­ment in suffi­cient de­tail to model its con­trol­ling mechanisms. And, with enough ex­pe­rience, it can hap­pen to agents who can’t even model their en­vi­ron­ment.