Can corrigibility be learned safely?

EDIT: Please note that the way I use the word “cor­rigi­bil­ity” in this post isn’t quite how Paul uses it. See this thread for clar­ifi­ca­tion.

This is mostly a re­ply to Paul Chris­ti­ano’s Univer­sal­ity and se­cu­rity am­plifi­ca­tion and as­sumes fa­mil­iar­ity with that post as well as Paul’s AI al­ign­ment ap­proach in gen­eral. See also my pre­vi­ous com­ment for my un­der­stand­ing of what cor­rigi­bil­ity means here and the mo­ti­va­tion for want­ing to do AI al­ign­ment through cor­rigi­bil­ity learn­ing in­stead of value learn­ing.

Con­sider the trans­la­tion ex­am­ple again as an anal­ogy about cor­rigi­bil­ity. Paul’s al­ign­ment ap­proach de­pends on hu­mans hav­ing a no­tion of “cor­rigi­bil­ity” (roughly “be­ing helpful to the user and keep­ing the user in con­trol”) which is pre­served by the am­plifi­ca­tion scheme. Like the in­for­ma­tion that a hu­man uses to do trans­la­tion, the de­tails of this no­tion may also be stored as con­nec­tion weights in the deep lay­ers of a large neu­ral net­work, so that the only way to ac­cess them is to provide in­puts to the hu­man of a form that the net­work was trained on. (In the case of trans­la­tion, this would be sen­tences and as­so­ci­ated con­text, while in the case of cor­rigi­bil­ity this would be ques­tions/​tasks of a hu­man un­der­stand­able na­ture and con­text about the user’s back­ground and cur­rent situ­a­tion.) This seems plau­si­ble be­cause in or­der for a hu­man’s no­tion of cor­rigi­bil­ity to make a differ­ence, the hu­man has to ap­ply it while think­ing about the mean­ing of a re­quest or ques­tion and “trans­lat­ing” it into a se­ries of smaller tasks.

In the lan­guage trans­la­tion ex­am­ple, if the task of trans­lat­ing a sen­tence is bro­ken down into smaller pieces, the sys­tem could no longer ac­cess the full knowl­edge the Overseer has about trans­la­tion. By anal­ogy, if the task of break­ing down tasks in a cor­rigible way is it­self bro­ken down into smaller pieces (ei­ther for se­cu­rity or be­cause the in­put task and as­so­ci­ated con­text is so com­plex that a hu­man couldn’t com­pre­hend it in the time al­lot­ted), then the sys­tem might no longer be able to ac­cess the full knowl­edge the Overseer has about “cor­rigi­bil­ity”.

In ad­di­tion to “cor­rigi­bil­ity” (try­ing to be helpful), break­ing down a task also in­volves “un­der­stand­ing” (figur­ing out what the in­tended mean­ing of the re­quest is) and “com­pe­tence” (how to do what one is try­ing to do). By the same anal­ogy, hu­mans are likely to have in­tro­spec­tively in­ac­cessible knowl­edge about both un­der­stand­ing and com­pe­tence, which they can’t fully ap­ply if they are not able to con­sider a task as a whole.

Paul is aware of this prob­lem, at least with re­gard to com­pe­tence, and his pro­posed solu­tion is:

I pro­pose to go on break­ing tasks down any­way. This means that we will lose cer­tain abil­ities as we ap­ply am­plifi­ca­tion. [...] Effec­tively, this pro­posal re­places our origi­nal hu­man over­seer with an im­pov­er­ished over­seer, who is only able to re­spond to the billion most com­mon queries.

How bad is this, with re­gard to un­der­stand­ing and cor­rigi­bil­ity? Is an im­pov­er­ished over­seer who only learned a part of what a hu­man knows about un­der­stand­ing and cor­rigi­bil­ity still un­der­stand­ing/​cor­rigible enough? I think the an­swer is prob­a­bly no.

With re­gard to un­der­stand­ing, nat­u­ral lan­guage is fa­mously am­bigu­ous. The fact that a sen­tence is am­bigu­ous (has mul­ti­ple pos­si­ble mean­ings de­pend­ing on con­text) is it­self of­ten far from ap­par­ent to some­one with a shal­low un­der­stand­ing of the lan­guage. (See here for a re­cent ex­am­ple on LW.) So the over­seer will end up be­ing overly literal, and mis­in­ter­pret­ing the mean­ing of nat­u­ral lan­guage in­puts with­out re­al­iz­ing it.

With re­gard to cor­rigi­bil­ity, if I try to think about what I’m do­ing when I’m try­ing to be cor­rigible, it seems to boil down to some­thing like this: build a model of the user based on all available in­for­ma­tion and my prior about hu­mans, use that model to help im­prove my un­der­stand­ing of the mean­ing of the re­quest, then find a course of ac­tion that best bal­ances be­tween satis­fy­ing the re­quest as given, up­hold­ing (my un­der­stand­ing of) the user’s morals and val­ues, and most im­por­tantly keep­ing the user in con­trol. Much of this seems to de­pend on in­for­ma­tion (prior about hu­mans), pro­ce­dure (how to build a model of the user), and judg­ment (how to bal­ance be­tween var­i­ous con­sid­er­a­tions) that are far from in­tro­spec­tively ac­cessible.

So if we try to learn un­der­stand­ing and cor­rigi­bil­ity “safely” (i.e., in small chunks), we end up with an overly literal over­seer that lacks com­mon sense un­der­stand­ing of lan­guage and in­de­pen­dent judg­ment of what the user’s wants, needs, and shoulds are and how to bal­ance be­tween them. How­ever, if we am­plify the over­seer enough, even­tu­ally the AI will have the op­tion of learn­ing un­der­stand­ing and cor­rigi­bil­ity from ex­ter­nal sources rather than rely­ing on its poor “na­tive” abil­ities. As Paul ex­plains with re­gard to trans­la­tion:

This is po­ten­tially OK, as long as we learn a good policy for lev­er­ag­ing the in­for­ma­tion in the en­vi­ron­ment (in­clud­ing hu­man ex­per­tise). This can then be dis­til­led into a state main­tained by the agent, which can be as ex­pres­sive as what­ever state the agent might have learned. Lev­er­ag­ing ex­ter­nal facts re­quires mak­ing a trade­off be­tween the benefits and risks, so we haven’t elimi­nated the prob­lem, but we’ve po­ten­tially iso­lated it from the prob­lem of train­ing our agent.

So in­stead of di­rectly try­ing to break down a task, the AI would first learn to un­der­stand nat­u­ral lan­guage and what “be­ing helpful” and “keep­ing the user in con­trol” in­volve from ex­ter­nal sources (pos­si­bly in­clud­ing texts, au­dio/​video, and queries to hu­mans), dis­till that into some com­pressed state, then use that knowl­edge to break down the task in a more cor­rigible way. But first, since the lower-level (less am­plified) agents are con­tribut­ing lit­tle be­sides the abil­ity to ex­e­cute literal-minded tasks that don’t re­quire in­de­pen­dent judg­ment, it’s un­clear what ad­van­tages there are to do­ing this as an Am­plified agent as op­posed to us­ing ML di­rectly to learn these things. And sec­ond, try­ing to learn un­der­stand­ing and cor­rigi­bil­ity from ex­ter­nal hu­mans has the same prob­lem as try­ing to learn from the hu­man Overseer: if you try to learn in large chunks, you risk cor­rupt­ing the ex­ter­nal hu­man and then learn­ing cor­rupted ver­sions of un­der­stand­ing and cor­rigi­bil­ity, but if you try to learn in small chunks, you won’t get all the in­for­ma­tion that you need.

The con­clu­sion here seems to be that cor­rigi­bil­ity can’t be learned safely, at least not in a way that’s clear to me.