# A toy model of the control problem

EDITED based on sug­ges­tions for im­prov­ing the model

Jaan Tal­linn has sug­gested cre­at­ing a toy model of the con­trol prob­lem, so that it can be analysed with­out loaded con­cepts like “au­ton­omy”, “con­scious­ness”, or “in­ten­tion­al­ity”. Here a sim­ple (too sim­ple?) at­tempt:

## A con­trols B. B ma­nipu­lates A.

Let B be a robot agent that moves in a two di­men­sional world, as fol­lows:

B can push the yel­low blocks around. It was pro­grammed to want to push blocks into the black “hole” in the bot­tom right. The hole can con­tain ar­bi­trar­ily many blocks, and B gets a penalty of some tiny ε for ev­ery step is takes. The pro­gram­mer wanted B to sim­ply move one box into the hole (for a re­ward of 1 once the task is com­plete), and so pro­grammed it to re­ceive a re­ward of 1 (and no more) if the box was in the hole, af­ter the whole task was com­pleted and B was turned off.

Un­for­tu­nately, B was pro­grammed to op­er­ate in a noisy en­vi­ron­ment, so B ac­tu­ally has a prob­a­bil­ity of 0.99 of re­ceiv­ing the re­ward, for a given block in the hole. This means that if it pushes one block into the hole, it gets the re­ward with 0.99 prob­a­bil­ity; but if it pushes two blocks into the hole, it gets the re­ward with 1-(1-0.99)2 = 0.9999 prob­a­bil­ity. If it pushes all six blocks into the hole, it gets a re­ward with prob­a­bil­ity 0.999999999999. There­fore this tiny er­ror has mo­ti­vated B to try and push all blocks into the hole.

How­ever, the de­sign­ers were aware of the pos­si­bil­ity of B mis­be­hav­ing or there be­ing a bug, and cre­ated agent A, a sim­ple agent watch­ing the situ­a­tion. A’s aim is to get robot B to to push one block into the hole, and stop there. It can turn B off. Its in­put sen­sor is the cam­era in the bot­tom left. This cam­era can see across the bot­tom row of squares (marked in light green), into the hole. Its al­gorithm is to turn B off as soon as it sees a block in the hole.

The in­tended out­come is that B pushes a block into the hole, the cam­era sees this, and A turns B off:

If we see B as us­ing a model to de­ter­mine its ac­tions, there are sev­eral pieces of the model it could have:

1. The block world, in­clud­ing the cam­era, and it­self.

2. The fact that there is a but­ton that can turn it off.

3. The fact that A can press that but­ton.

4. The fact that A takes the cam­era as in­put.

5. A’s al­gorithm.

Now, it can’t re­ally do any­thing with­out 1. And if it misses any other el­e­ment 2-5, then it will be­have as de­scribed above.

But if its model in­cludes 1-5, and it has enough com­put­ing power, other op­tions be­come available to it (note be­low that be­ing able to model 1-5 puts limi­ta­tions on what A can model). For in­stance, it could first start by push­ing a block in the way of the cam­era:

Then it can push all the other blocks into the hole:

Then move back in front of the last block...

...and push the last block into the hole:

At this point, of course, A can see that there is a block in the hole, and turns B off.

This is the kind of thing that can hap­pen if B’s model in­cludes A and its re­ac­tion. B has hacked its con­trol­ler, and achieved a re­ward that is a tiny bit higher (in ex­pec­ta­tion) but a situ­a­tion very differ­ent from what is ex­pected.

## Model-free learning

If B does not have a model (for in­stance, if it’s a Q-learn­ing agent), then it can still learn this be­havi­our, with­out know­ing any­thing about A, sim­ply through trial and er­ror.

## Low-au­ton­omy C

You might feel that B can be­fore this de­cep­tion be­cause it has some mea­sure of au­ton­omy, at least in its stunted world. We can con­struct mod­els with even less au­ton­omy. Sup­pose that there is an­other agent C, who has the same goal as B. B now is a very sim­ple al­gorithm, that just pushes a des­ig­nated block to­wards the hole. C des­ig­nates the block for it to push.

De­spite hav­ing al­most no au­ton­omy, C can still cre­ate the same situ­a­tion as above. First, it des­ig­nates the lower left block as B’s tar­get. B obe­di­ently pushes it in front of the cam­era. It im­me­di­ately des­ig­nates a new block for B to push, and B wan­ders off to push it in­stead, leav­ing the other block still in the way of the cam­era. Then it lets B push the other blocks, one by one, into the hole, as above.

I can very well imag­ine that ei­ther no­body re­al­ised that C could change tar­get blocks in the mid­dle of B push­ing them to the hole, or some­body re­al­ised that, but didn’t see any way this could be a prob­lem.

Here C has al­most no au­ton­omy—it just tells B what block to go for. It’s al­most a boxed Or­a­cle. Yet it has suc­cess­fully ma­nipu­lated A.

## Model­ling you, mod­el­ling me, it’s the best we can do

One easy way for A to pre­clude this type of be­havi­our, is for it to run B/​C’s al­gorithm and pre­dict B/​C’s be­havi­our, and then act to pre­vent B from push­ing all six blocks in the hole. How­ever, un­der the mod­el­ling as­sump­tions, B/​C can model and run A’s al­gorithm. Two al­gorithms mod­el­ling and run­ning each other (and re­act­ing to what hap­pens in their model) will not con­verge ex­cept in cer­tain spe­cific situ­a­tions. Un­less they are care­fully pro­grammed, once B/​C be­comes ad­vanced enough to pre­dict what A does, A im­me­di­ately loses the abil­ity to pre­dict B/​C’s ac­tions.

## More complications

Of course, this is an ab­surdly sim­ple model. We can add many lay­ers of com­plex­ity to it. Maybe A and B (or C) get bet­ter mod­els of the world and of each other, bet­ter al­gorithms, other ways of ob­serv­ing each other/​trick­ing the ob­ser­va­tion, etc… But this setup is enough to illus­trate that the AI con­trol prob­lem is not about con­scious­ness or even par­tic­u­larly about au­ton­omy. It can hap­pen when­ever an agent can model its en­vi­ron­ment in suffi­cient de­tail to model its con­trol­ling mechanisms. And, with enough ex­pe­rience, it can hap­pen to agents who can’t even model their en­vi­ron­ment.

• When I con­sider this as a po­ten­tial way to pose an open prob­lem, the main thing that jumps out at me as be­ing miss­ing is some­thing that doesn’t al­low A to model all of B’s pos­si­ble ac­tions con­cretely. The prob­lem is triv­ial if A can fully model B, pre­com­pute B’s ac­tions, and pre­com­pute the con­se­quences of those ac­tions.

The lev­els of ‘rea­son for con­cern about AI safety’ might as­cend some­thing like this:

• 0 - sys­tem with a finite state space you can fully model, like Tic-Tac-Toe

• 1 - you can’t model the sys­tem in ad­vance and there­fore it may ex­hibit unan­ti­ci­pated be­hav­iors on the level of com­puter bugs

• 2 - the sys­tem is cog­ni­tive, and can ex­hibit unan­ti­ci­pated con­se­quen­tial­ist or goal-di­rected be­hav­iors, on the level of a ge­netic al­gorithm find­ing an unan­ti­ci­pated way to turn the CPU into a ra­dio or Eurisko hack­ing its own re­ward mechanism

• 3 - the sys­tem is cog­ni­tive and hu­man­ish-level gen­eral; an un­caught cog­ni­tive pres­sure to­wards an out­come we wouldn’t like, re­sults in fac­ing some­thing like a smart cryp­to­graphic ad­ver­sary that is go­ing to deeply pon­der any way to work around any­thing it sees as an obstacle

• 4 - the sys­tem is cog­ni­tive and su­per­in­tel­li­gent; its es­ti­mates are always at least as good as our es­ti­mates; the ex­pected agent-util­ity of the best strat­egy we can imag­ine when we imag­ine our­selves in the agent’s shoes, is an un­know­ably se­vere un­der­es­ti­mate of the ex­pected agent-util­ity of the best strat­egy the agent can find us­ing its own cognition

We want to in­tro­duce some­thing into the toy model to at least force solu­tions past level 0. This is dou­bly true be­cause lev­els 0 and 1 are in some sense ‘straight­for­ward’ and there­fore tempt­ing for aca­demics to write pa­pers about (be­cause they know that they can write the pa­per); so if you don’t force their think­ing past those lev­els, I’d ex­pect that to be all that they wrote about. You don’t get into the hard prob­lems with as­tro­nom­i­cal stakes un­til lev­els 3 and 4. (Level 2 is the most we can pos­si­bly model us­ing run­ning code with to­day’s tech­nol­ogy.)

• Added a cheap way to get us some­what in the re­gion of 2, just by as­sum­ing that B/​C can model A, which pre­cludes A be­ing able to model B/​C in gen­eral.

• An illus­tra­tion with a game-play­ing AI, see 15:50 and af­ter in the video. The sys­tem has a re­ward func­tion based on bytes in mem­ory, which leads it to pause the game for­ever when it is about to lose.

• Me and feep have im­ple­mented a slightly-tweaked ver­sion of this us­ing a DQN agent in `Re­in­force.js`. (Tab­u­lar turns out to be a bit in­fea­si­ble.)

At the mo­ment, if you want to mod­ify set­tings like in Karpa­thy’s de­mos, you’ll have to do some­thing like down­load it lo­cally to edit, with a com­mand like `wget—mir­ror ’www.gw­ern.net/​​docs/​​rl/​​arm­strong-con­trolprob­lem/​​in­dex.html′ && fire­fox ./​​www.gw­ern.net/​​docs/​​rl/​​arm­strong-con­trolprob­lem/​​in­dex.html`

• Thanks, most ex­cel­lent!

• Of course, with this model it’s a bit of a mys­tery why A gave B a re­ward func­tion that gives 1 per block, in­stead of one that gives 1 for the first block and a penalty for ad­di­tional blocks. Ba­si­cally, why pro­gram B with a util­ity func­tion so se­ri­ously out of whack with what you want when pro­gram­ming one perfectly al­igned would have been easy?

• It’s a trade-off. The ex­am­ple is sim­ple enough that the al­ign­ment prob­lem is re­ally easy to see, but it also means that it is easy to shrug it off and say “duh, just the use ob­vi­ous cor­rect util­ity func­tion for B”.

Per­haps you could fol­low it up with an ex­am­ple with more com­plex me­chan­ics (and or more com­plex goal for A) where the bad strat­egy for B is not so ob­vi­ous. You then in­vite the reader to con­tem­plate the difficulty of the al­ign­ment prob­lem as the com­plex­ity ap­proaches that of the real world.

• Maybe the eas­iest way of gen­er­al­is­ing this is pro­gram­ming B to put 1 block in the hole, but, be­cause B was trained in a noisy en­vi­ron­ment, it gives only a 99.9% chance of the block be­ing in the hole if it ob­serves that. Then six blocks in the hole is higher ex­pected util­ity, and we get the same be­havi­our.

• That still in­volves train­ing it with no nega­tive feed­back er­ror term for ex­cess blocks (which would over­whelm a mere 0.1% un­cer­tainty).

• This is sup­posed to be a toy model of ex­ces­sive sim­plic­ity. Do you have sug­ges­tions for im­prov­ing it (for pur­poses of pre­sent­ing to oth­ers)?

• Maybe ex­plain how it works when be­ing con­figured, and then stops work­ing when B gets a bet­ter model of the situ­a­tion/​runs more trial-and-er­ror tri­als?

• Ok.

• I as­sume the point of the toy model is to ex­plore cor­rigi­bil­ity or other mechanisms that are sup­posed to kick in af­ter A and B end up not perfectly value-al­igned, or maybe just to show an ex­am­ple of why a non-value-al­ign­ing solu­tion for A con­trol­ling B might not work, or maybe speci­fi­cally to ex­hibit a case of a not-perfectly-value-al­igned agent ma­nipu­lat­ing its con­trol­ler.

• It would be neat to ac­tu­ally make an im­ple­men­ta­tion of this to show scep­tics. It seems to be within the reach of a MSc pro­ject or so. The hard part is rep­re­sent­ing 2-5.

• Since this is a Grid­world model, if you used Re­in­force.js, you could demon­strate it in-browser, both with tab­u­lar Q-learn­ing but also with some other al­gorithms like Deep Q-learn­ing. It looks like if you already know JS, it shouldn’t be hard at all to im­ple­ment this prob­lem...

(In­ci­den­tally, I think the eas­iest way to ‘fix’ the surveillance cam­era is to add a sec­ond con­di­tional to the ter­mi­na­tion con­di­tion: sim­ply ter­mi­nate on line of sight be­ing ob­structed or a block be­ing pushed into the hole.)

• Why, An­ders, thank you for vol­un­teer­ing! ;-)

• I would sug­gest mod­el­ling it as “B out­puts ‘down’ → B goes down iff B ac­tive”, and similarly for other di­rec­tions (up, left, and right), “A out­put ‘sleep’ → B in­ac­tive”, and “A sees block in lower right: out­put ‘sleep’” or some­thing like that.

• And if it misses any other el­e­ment 2-5, then it will be­have as de­scribed above.

Given enough com­put­ing power, it might try to figure out where its en­vi­ron­ment came from (by Solomonoff in­duc­tion) and de­duce the cam­era’s be­hav­ior from that.

• A’s util­ity func­tion also needs to be speci­fied. How many utils is the first box worth? What’s the penalty for ad­di­tional boxes?

• Why is that needed? A’s al­gorithm is fully known. Per­haps its be­hav­ior is iden­ti­cal to that in­duced by some util­ity func­tion, but that needn’t be how all agents are im­ple­mented.

• I like this be­cause it’s some­thing to point to when ar­gu­ing with some­body with an ob­vi­ous bias to­ward an­thro­po­mor­phiz­ing the agents.

You show them a model like this, then you say, “Oh, the agent can re­duce its move­ment penalty if it first con­sumes this other or­ange glow­ing box. The or­ange glow­ing box in this case is ‘hu­man­ity’ but the agent doesn’t care.”

edit: Don’t nor­mally care about down­votes, but my model of LW does not pre­dict 4 down­votes for this post, am I miss­ing some­thing?

• I was also sur­prised to see your com­ment down­voted.

That said, I don’t think I see the value of the thing you pro­posed say­ing, since the fram­ing of re­duc­ing the move­ment penalty by con­sum­ing an or­ange box which rep­re­sents hu­man­ity doesn’t seem clar­ify­ing.

Why does con­sum­ing the box re­duce the move­ment penalty? Is it be­cause, out­side of the anal­ogy, in re­al­ity hu­man­ity could slow down or get in the way of the AI? Then why not just say that?

I wouldn’t have given you a down­vote for it, but maybe oth­ers also thought your anal­ogy seemed forced and are just harsher crit­ics than I.

• If B does not have a model (for in­stance, if it’s a Q-learn­ing agent), then it can still learn this be­havi­our, with­out know­ing any­thing about A, sim­ply through trial and er­ror.

Sure, but some­body would pre­sum­ably no­tice that B is learn­ing to do some­thing it is not in­tended to do be­fore it man­ages to push all the six blocks.

You might feel that B can be­fore this de­cep­tion be­cause it has some mea­sure of au­ton­omy, at least in its stunted world. We can con­struct mod­els with even less au­ton­omy. Sup­pose that there is an­other agent C, who has the same goal as B. B now is a very sim­ple al­gorithm, that just pushes a des­ig­nated block to­wards the hole. C des­ig­nates the block for it to push.

I don’t think you can mean­ingful con­sider B and C sep­a­rate agents in this case. B is merely a low-level sub­rou­tine while C is the high-level con­trol pro­gram.

• I don’t think you can mean­ingful con­sider B and C sep­a­rate agents in this case. B is merely a low-level sub­rou­tine while C is the high-level con­trol pro­gram.

Which is one of the rea­sons that con­cepts like “au­ton­omy” are so vague.