What is am­bi­tious value learn­ing?

I think of am­bi­tious value learn­ing as a pro­posed solu­tion to the spe­cific­a­tion prob­lem, which I define as the prob­lem of de­fin­ing the be­ha­vior that we would want to see from our AI sys­tem. I it­alicize “de­fin­ing” to em­phas­ize that this is not the prob­lem of ac­tu­ally com­put­ing be­ha­vior that we want to see—that’s the full AI safety prob­lem. Here we are al­lowed to use hope­lessly im­prac­tical schemes, as long as the res­ult­ing defin­i­tion would al­low us to in the­ory com­pute the be­ha­vior that an AI sys­tem would take, per­haps with as­sump­tions like in­fin­ite com­put­ing power or ar­bit­rar­ily many quer­ies to a hu­man. (Al­though we do prefer spe­cific­a­tions that seem like they could ad­mit an ef­fi­cient im­ple­ment­a­tion.) In terms of Deep­Mind’s clas­si­fic­a­tion, we are look­ing for a design spe­cific­a­tion that ex­actly matches the ideal spe­cific­a­tion. HCH and in­dir­ect norm­ativ­ity are ex­amples of at­tempts at such spe­cific­a­tions.

We will con­sider a model in which our AI sys­tem is max­im­iz­ing the ex­pec­ted util­ity of some ex­pli­citly rep­res­en­ted util­ity func­tion that can de­pend on his­tory. (It does not mat­ter ma­ter­i­ally whether we con­sider util­ity func­tions or re­ward func­tions, as long as they can de­pend on his­tory.) The util­ity func­tion may be learned from data, or de­signed by hand, but it must be an ex­pli­cit part of the AI that is then max­im­ized.

I will not jus­tify this model for now, but simply as­sume it by fiat and see where it takes us. I’ll note briefly that this model is of­ten jus­ti­fied by the VNM util­ity the­orem and AIXI, and as the nat­ural ideal­iz­a­tion of re­in­force­ment learn­ing, which aims to max­im­ize the ex­pec­ted sum of re­wards, al­though typ­ic­ally re­wards in RL de­pend only on states.

A lot of con­cep­tual ar­gu­ments, as well as ex­per­i­ences with spe­cific­a­tion gam­ing, sug­gest that we are un­likely to be able to simply think hard and write down a good spe­cific­a­tion, since even small er­rors in spe­cific­a­tions can lead to bad res­ults. However, ma­chine learn­ing is par­tic­u­larly good at nar­row­ing down on the cor­rect hy­po­thesis among a vast space of pos­sib­il­it­ies us­ing data, so per­haps we could de­term­ine a good spe­cific­a­tion from some suit­ably chosen source of data? This leads to the idea of am­bi­tious value learn­ing, where we learn an ex­pli­cit util­ity func­tion from hu­man be­ha­vior for the AI to max­im­ize.

This is very re­lated to in­verse re­in­force­ment learn­ing (IRL) in the ma­chine learn­ing lit­er­at­ure, though not all work on IRL is rel­ev­ant to am­bi­tious value learn­ing. For ex­ample, much work on IRL is aimed at im­it­a­tion learn­ing, which would in the best case al­low you to match hu­man per­form­ance, but not to ex­ceed it. Am­bi­tious value learn­ing is, well, more am­bi­tious—it aims to learn a util­ity func­tion that cap­tures “what hu­mans care about”, so that an AI sys­tem that op­tim­izes this util­ity func­tion more cap­ably can ex­ceed hu­man per­form­ance, mak­ing the world bet­ter for hu­mans than they could have done them­selves.

It may sound like we would have solved the en­tire AI safety prob­lem if we could do am­bi­tious value learn­ing—surely if we have a good util­ity func­tion we would be done. Why then do I think of it as a solu­tion to just the spe­cific­a­tion prob­lem? This is be­cause am­bi­tious value learn­ing by it­self would not be enough for safety, ex­cept un­der the as­sump­tion of as much com­pute and data as de­sired. These are really power­ful as­sump­tions—for ex­ample, I’m as­sum­ing you can get data where you put a hu­man in an ar­bit­rar­ily com­plic­ated sim­u­lated en­vir­on­ment with fake memor­ies of their life so far and see what they do. This al­lows us to ig­nore many things that would likely be a prob­lem in prac­tice, such as:

  • At­tempt­ing to use the util­ity func­tion to choose ac­tions be­fore it has converged

  • Dis­tri­bu­tional shift caus­ing the learned util­ity func­tion to be­come invalid

  • Local min­ima pre­vent­ing us from learn­ing a good util­ity func­tion, or from op­tim­iz­ing the learned util­ity func­tion correctly

The next few posts in this se­quence will con­sider the suit­ab­il­ity of am­bi­tious value learn­ing as a solu­tion to the spe­cific­a­tion prob­lem. Most of them will con­sider whether am­bi­tious value learn­ing is pos­sible in the set­ting above (in­fin­ite com­pute and data). One post will con­sider prac­tical is­sues with the ap­plic­a­tion of IRL to in­fer a util­ity func­tion suit­able for am­bi­tious value learn­ing, while still as­sum­ing that the res­ult­ing util­ity func­tion can be per­fectly max­im­ized (which is equi­val­ent to as­sum­ing in­fin­ite com­pute and a per­fect model of the en­vir­on­ment after IRL has run).

The next post in the ‘Value Learn­ing’ se­quence, ‘The easy goal in­fer­ence prob­lem is still hard’ by Paul Chris­ti­ano, will come out on Saturday Novem­ber 3rd.

To­mor­row’s AI Align­ment Forum se­quences post will be ‘Embed­ded World-Models’, in the se­quence ‘Embed­ded Agency’.