A first look at the hard problem of corrigibility

Sum­mary: We would like to build cor­rigible AIs, which do not pre­vent us from shut­ting them down or chang­ing their util­ity func­tion. While there are some cor­rigi­bil­ity solu­tions (such as util­ity in­differ­ence) that ap­pear to par­tially work, they do not cap­ture the philo­soph­i­cal in­tu­ition be­hind cor­rigi­bil­ity: we want an agent that not only al­lows us to shut it down, but also de­sires for us to be able to shut it down if we want to. In this post, we look at a few mod­els of util­ity func­tion un­cer­tainty and find that they do not solve the cor­rigi­bil­ity prob­lem.


Introduction

Eliezer de­scribes the hard prob­lem of cor­rigi­bil­ity on Ar­bital:

On a hu­man, in­tu­itive level, it seems like there’s a cen­tral idea be­hind cor­rigi­bil­ity that seems sim­ple to us: un­der­stand that you’re flawed, that your meta-pro­cesses might also be flawed, and that there’s an­other cog­ni­tive sys­tem over there (the pro­gram­mer) that’s less flawed, so you should let that cog­ni­tive sys­tem cor­rect you even if that doesn’t seem like the first-or­der right thing to do. You shouldn’t dis­assem­ble that other cog­ni­tive sys­tem to up­date your model in a Bayesian fash­ion on all pos­si­ble in­for­ma­tion that other cog­ni­tive sys­tem con­tains; you shouldn’t model how that other cog­ni­tive sys­tem might op­ti­mally cor­rect you and then carry out the cor­rec­tion your­self; you should just let that other cog­ni­tive sys­tem mod­ify you, with­out at­tempt­ing to ma­nipu­late how it mod­ifies you to be a bet­ter form of ‘cor­rec­tion’.

For­mal­iz­ing the hard prob­lem of cor­rigi­bil­ity seems like it might be a prob­lem that is hard (hence the name). Pre­limi­nary re­search might talk about some ob­vi­ous ways that we could model A as be­liev­ing that B has some form of in­for­ma­tion that A’s prefer­ence frame­work des­ig­nates as im­por­tant, and show­ing what these al­gorithms ac­tu­ally do and how they fail to solve the hard prob­lem of cor­rigi­bil­ity.

The ob­jec­tive of this post is to be some of the pre­limi­nary re­search de­scribed in the sec­ond para­graph.

Setup

We will as­sume that the AI ex­ists in the same world as the hu­man. We will ex­am­ine var­i­ous mod­els the AI could use for the hu­man and the true util­ity func­tion. None of these mod­els will truly yield a cor­rigible agent.

1. The hu­man is a log­i­cally om­nis­cent Bayesian util­ity max­i­mizer who knows their util­ity function

a) The hu­man is a black box

i) The hu­man is aware of the AI

If the AI mod­els the hu­man as a black box Bayesian util­ity max­i­mizer who knows about the AI, then it can as­sume that the hu­man will com­mu­ni­cate their util­ity func­tion to the AI effi­ciently. This leads to a sig­nal­ling equil­ibrium in which the hu­man com­mu­ni­cates the cor­rect util­ity func­tion to the AI us­ing an op­ti­mal code. So the AI will as­sume that the hu­man com­mu­ni­cates the util­ity func­tion, e.g. by writ­ing it as a com­puter pro­gram.

Of course, in real life this will not work, be­cause the hu­man will be un­able to write their util­ity func­tion.

ii) The hu­man is not aware of the AI

If the hu­man is not aware of the AI, then the AI must learn the hu­man’s val­ues by ob­serv­ing the hu­man in­ter­act­ing with the world, rather than through sig­nal­ling. Since this model as­sumes the hu­man is perfectly ra­tio­nal, it is very close to value learn­ing mod­els used in eco­nomics. How­ever, these mod­els are in­ap­pro­pri­ate for cor­rigi­bil­ity, be­cause cor­rigi­bil­ity re­quires the hu­man to in­ter­act with the AI (e.g. by shut­ting it down). Ad­di­tion­ally, the AI will want to ma­nipu­late the hu­man into be­ing an effi­cient moral­ity sen­sor; for ex­am­ple, it may set up trol­ley prob­lems for the hu­man to en­counter. This will not yield the right an­swer un­less the value learn­ing model is cor­rect (which it isn’t, be­cause hu­mans are not log­i­cally om­nis­cent Bayesian util­ity max­i­miz­ers).

b) The hu­man is not a black box

Here, the AI can pos­si­bly gain in­for­ma­tion about the hu­man’s util­ity func­tion faster than in the sig­nal­ling equil­ibrium, by tak­ing apart the hu­man’s brain (liter­ally or metaphor­i­cally). This will give a suffi­ciently pow­er­ful AI enough in­for­ma­tion to pre­dict the hu­man’s ac­tions in many differ­ent pos­si­ble situ­a­tions. There­fore, the AI will need to fur­ther ob­ser­va­tions of the hu­man. We ex­pect this to be bad, be­cause it re­quires the AI’s value learn­ing al­gorithm to be cor­rect from the start. Cer­tainly, this does not count as cor­rigible be­hav­ior!

2. The hu­man is a log­i­cally om­nis­cent Bayesian util­ity max­i­mizer who is un­cer­tain about their util­ity function

Bayesian un­cer­tainty about the util­ity func­tion, with no way of learn­ing about the util­ity func­tion, will not change much. If there is no way to learn about one’s util­ity func­tion, then (de­pend­ing on one’s ap­proach to nor­ma­tive un­cer­tainty) the op­ti­mal ac­tion is to op­ti­mize a weighted av­er­age of them, or some­thing similar to this. So this situ­a­tion is re­ally the same as in the case with the known util­ity func­tion.

3. The hu­man is a log­i­cally om­nis­cent Bayesian util­ity max­i­mizer who is un­cer­tain about their util­ity func­tion and ob­serves it over time

a) The hu­man is a black box

i) The hu­man is aware of the AI

As in 1ai, we get a sig­nal­ling equil­ibrium. In this ideal­ized model, in­stead of com­mu­ni­cat­ing the full util­ity func­tion at the start, the hu­man will com­mu­ni­cate ob­ser­va­tions of the true util­ity func­tion (i.e. moral in­tu­itions) over time. So the AI will keep the hu­man al­ive and use them as a moral­ity sen­sor. Similar to 1a, this fails in real life be­cause it re­quires the hu­man to spec­ify their moral in­tu­itions.

ii) The hu­man is not aware of the AI

As in 1aii, the AI will learn about the hu­man’s val­ues from the hu­man’s in­ter­ac­tion with the world. This is differ­ent from 1aii in that the AI can­not as­sume that the hu­man makes con­sis­tent de­ci­sions over time (be­cause the hu­man learns more about the util­ity func­tion over time). How­ever, in prac­tice it is similar: the AI will ma­nipu­late the hu­man into be­ing an op­ti­mal moral­ity sen­sor, only it will do this differ­ently to ac­count for the fact that the hu­man gains moral up­dates over time.

b) The hu­man is not a black box

As in 1b, the AI might more effi­ciently gain in­for­ma­tion about the util­ity func­tion by tak­ing apart the hu­man’s brain. Then, it can pre­dict the hu­man’s ac­tions in pos­si­ble fu­ture situ­a­tions. This in­cludes pre­dic­tions about what moral in­tu­itions the hu­man would com­mu­ni­cate. Similar to 1b, this fails in real life be­cause it re­quires the AI’s value learn­ing al­gorithm to be cor­rect.

4. One of the above, but the hu­man also has un­cer­tainty about math­e­mat­i­cal statements

In this case the hu­man solves the prob­lem as be­fore, ex­cept that they del­e­gate math­e­mat­i­cal un­cer­tainty ques­tions to the AI. For ex­am­ple, the hu­man might write out their true util­ity func­tion as a math­e­mat­i­cal ex­pres­sion that con­tains difficult-to-com­pute num­bers. This re­quires the AI to im­ple­ment a solu­tion to log­i­cal un­cer­tainty, but even if we already had such a solu­tion, this would still place un­re­al­is­tic de­mands on the hu­man (namely, re­duc­ing the value al­ign­ment prob­lem to a math­e­mat­i­cal prob­lem).

Conclusion

This short overview of cor­rigi­bil­ity mod­els shows that sim­ple un­cer­tainty about the cor­rect util­ity func­tion is not suffi­cient for cor­rigi­bil­ity. It is not clear what the cor­rect solu­tion to the hard prob­lem of cor­rigi­bil­ity is. Per­haps it will in­volve some model like those in this post in which hu­mans are bounded in a spe­cific way that causes them to de­sire cor­rigible AIs, or per­haps it will look com­pletely differ­ent from these mod­els.