Plausibly, almost every powerful algorithm would be manipulative

I had an in­ter­est­ing de­bate re­cently, about whether we could make smart AIs safe just by fo­cus­ing on their struc­ture and their task. Speci­fi­cally, we were pon­der­ing some­thing like:

  • “Would an al­gorithm be safe if it was a neu­ral net-style image clas­sifier, trained on ex­am­ples of melanoma to de­tect skin can­cer, with no other role than to out­put a prob­a­bil­ity es­ti­mate for a given pic­ture? Even if “su­per­in­tel­li­gent”, could such an al­gorithm be an ex­is­ten­tial risk?”

Whether it’s an ex­is­ten­tial risk was not re­solved; but I have a strong in­tu­ition that they would like be ma­nipu­la­tive. Let’s see how.

The re­quire­ments for manipulation

For an al­gorithm to be ma­nipu­la­tive, it has to de­rive some ad­van­tage from ma­nipu­la­tion, and it needs to be able to learn to ma­nipu­late—for that, it needs to be able to ex­plore situ­a­tions where it en­gages in ma­nipu­la­tion and this is to its benefit.

There are cer­tainly very sim­ple situ­a­tions where ma­nipu­la­tion can emerge. But that ex­am­ple, though sim­ple, had an agent that was ac­tive in the world. Can a clas­sifier dis­play the same sort of be­havi­our?

Ma­nipu­la­tion emerges naturally

To show that, pic­ture the fol­low­ing de­sign. The pro­gram­mers have a large col­lec­tion of slightly differ­ent datasets, and want to train the al­gorithm on all of them. The loss func­tion is an er­ror rate, which can vary be­tween and . Many of the hy­per­pa­ram­e­ters are set by a neu­ral net, which it­self takes a more “long-term view” of the er­ror rate, try­ing to im­prove it from day to day rather than from run to run.

How have the pro­gram­mers set up the sys­tem? Well, they run the al­gorithm on batched sam­ples from ten datasets at once, and record the er­ror rate for all ten. The hy­per­pa­ram­e­ters are set to min­imise av­er­age er­ror over each run of ten. When the perfor­mance on one dataset falls be­low er­ror for a few runs, they re­move it from the batches, and sub­sti­tute in a new one to train the al­gorithm on[1].

So, what will hap­pen? Well, the sys­tem will ini­tially start to min­imise the loss rate. Then, af­ter a while, it will start “notic­ing” that when this er­ror is be­low , then the dataset is re­moved and the av­er­age er­ror rate shoots up. Ac­tu­ally, it won’t be “notic­ing” this; in­stead, if there are hy­per­pa­ram­e­ters that pre­vent the er­ror rate go­ing be­low , these will be se­lected by gra­di­ent de­scent as giv­ing a bet­ter perfor­mance.

So, this prob­lem of op­ti­mi­sa­tion of image clas­sifi­ca­tion will re­sult in the al­gorithm ma­nipu­lat­ing its pro­gram­mers, by “un­der­perform­ing” on key datasets, so that they stay in the batch.

Generic manipulation

Of course, the ma­nipu­la­tion above hap­pened be­cause the pro­gram­mers didn’t un­der­stand what the al­gorithm’s true loss func­tion was. They thought it was “min­imise over­all loss on clas­sifi­ca­tion”, but it was ac­tu­ally “keep each dataset loss just above ”.

But that kind of mis­take is easy to make—by care­less­ness, or be­cause they thought it would im­prove the situ­a­tion, the loss they thought they had di­verged from the ac­tual loss, and this re­sulted in ma­nipu­la­tive be­havi­our.

Here is an­other ex­am­ple: sup­pose the pro­gram­mers build in a pre­cau­tion for things go­ing wrong. If the er­ror rate sud­denly dou­bles on some dataset, then the al­gorithm is run ex­clu­sively on that dataset for a while, un­til the er­ror rate de­clines again (or, if it doesn’t, the pro­gram­mers are called).

Now sup­pose that, of the ten datasets it has, nine have an er­ror rate around , and one has an er­ror rate of . Then it is in the al­gorithm’s in­ter­est to boost the er­ror rate on that last dataset, sud­denly, to . Then it will be run for a time only on that one dataset, giv­ing it a lower av­er­age er­ror rate. Thus hy­per­pa­ram­e­ters that al­low this kind of be­havi­our, will get se­lected for (and if the al­gorithm is run on a lot of datasets, it will have many op­por­tu­ni­ties to learn about this kind of be­havi­our).

What about the “if it doesn’t, the pro­gram­mers are called” clause? This de­pends on what hap­pens when the pro­gram­mers are called, and how the er­ror rate is com­puted in that case. But, de­pend­ing on what does hap­pen, this in­cen­tivises the al­gorithm to learn to call the pro­gram­mers of­ten (dou­ble an er­ror rate and keep it high) or rarely (if there’s a risk of the er­ror rate for a dataset go­ing up sud­denly, then don’t lower it too much in the first place).

So, ma­nipu­la­tive be­havi­our can emerge in smart al­gorithms, even if they’re merely clas­sify­ing images in a way the pro­gram­mers feel is en­tirely harm­less.

Ex­is­ten­tial risks

Can this mild kind of ma­nipu­la­tion lead to ex­treme, ex­is­ten­tial risk out­comes? Of course it; it’s not hard to cre­ate toy mod­els where ev­ery­thing ends dis­as­trously. But it’s also not hard to cre­ate toy mod­els where ma­nipu­la­tion re­mains mild.

But, in con­clu­sion:

  • Ma­nipu­la­tive be­havi­ours can emerge in seem­ingly in­nocu­ous al­gorithms.

  • There is no clear di­vi­sion, cur­rently, be­tween mild ma­nipu­la­tion and dis­as­trous ma­nipu­la­tion.


  1. As­sume that ei­ther catas­trophic for­get­ting has been solved, or that they’ll re-run the re­moved datasets oc­ca­sion­ally, to re­fresh the al­gorithm’s perfor­mance on that dataset. ↩︎