What is ambitious value learning?

I think of am­bi­tious value learn­ing as a pro­posed solu­tion to the speci­fi­ca­tion prob­lem, which I define as the prob­lem of defin­ing the be­hav­ior that we would want to see from our AI sys­tem. I ital­i­cize “defin­ing” to em­pha­size that this is not the prob­lem of ac­tu­ally com­put­ing be­hav­ior that we want to see—that’s the full AI safety prob­lem. Here we are al­lowed to use hope­lessly im­prac­ti­cal schemes, as long as the re­sult­ing defi­ni­tion would al­low us to in the­ory com­pute the be­hav­ior that an AI sys­tem would take, per­haps with as­sump­tions like in­finite com­put­ing power or ar­bi­trar­ily many queries to a hu­man. (Although we do pre­fer speci­fi­ca­tions that seem like they could ad­mit an effi­cient im­ple­men­ta­tion.) In terms of Deep­Mind’s clas­sifi­ca­tion, we are look­ing for a de­sign speci­fi­ca­tion that ex­actly matches the ideal speci­fi­ca­tion. HCH and in­di­rect nor­ma­tivity are ex­am­ples of at­tempts at such speci­fi­ca­tions.

We will con­sider a model in which our AI sys­tem is max­i­miz­ing the ex­pected util­ity of some ex­plic­itly rep­re­sented util­ity func­tion that can de­pend on his­tory. (It does not mat­ter ma­te­ri­ally whether we con­sider util­ity func­tions or re­ward func­tions, as long as they can de­pend on his­tory.) The util­ity func­tion may be learned from data, or de­signed by hand, but it must be an ex­plicit part of the AI that is then max­i­mized.

I will not jus­tify this model for now, but sim­ply as­sume it by fiat and see where it takes us. I’ll note briefly that this model is of­ten jus­tified by the VNM util­ity the­o­rem and AIXI, and as the nat­u­ral ideal­iza­tion of re­in­force­ment learn­ing, which aims to max­i­mize the ex­pected sum of re­wards, al­though typ­i­cally re­wards in RL de­pend only on states.

A lot of con­cep­tual ar­gu­ments, as well as ex­pe­riences with speci­fi­ca­tion gam­ing, sug­gest that we are un­likely to be able to sim­ply think hard and write down a good speci­fi­ca­tion, since even small er­rors in speci­fi­ca­tions can lead to bad re­sults. How­ever, ma­chine learn­ing is par­tic­u­larly good at nar­row­ing down on the cor­rect hy­poth­e­sis among a vast space of pos­si­bil­ities us­ing data, so per­haps we could de­ter­mine a good speci­fi­ca­tion from some suit­ably cho­sen source of data? This leads to the idea of am­bi­tious value learn­ing, where we learn an ex­plicit util­ity func­tion from hu­man be­hav­ior for the AI to max­i­mize.

This is very re­lated to in­verse re­in­force­ment learn­ing (IRL) in the ma­chine learn­ing liter­a­ture, though not all work on IRL is rele­vant to am­bi­tious value learn­ing. For ex­am­ple, much work on IRL is aimed at imi­ta­tion learn­ing, which would in the best case al­low you to match hu­man perfor­mance, but not to ex­ceed it. Am­bi­tious value learn­ing is, well, more am­bi­tious—it aims to learn a util­ity func­tion that cap­tures “what hu­mans care about”, so that an AI sys­tem that op­ti­mizes this util­ity func­tion more ca­pa­bly can ex­ceed hu­man perfor­mance, mak­ing the world bet­ter for hu­mans than they could have done them­selves.

It may sound like we would have solved the en­tire AI safety prob­lem if we could do am­bi­tious value learn­ing—surely if we have a good util­ity func­tion we would be done. Why then do I think of it as a solu­tion to just the speci­fi­ca­tion prob­lem? This is be­cause am­bi­tious value learn­ing by it­self would not be enough for safety, ex­cept un­der the as­sump­tion of as much com­pute and data as de­sired. Th­ese are re­ally pow­er­ful as­sump­tions—for ex­am­ple, I’m as­sum­ing you can get data where you put a hu­man in an ar­bi­trar­ily com­pli­cated simu­lated en­vi­ron­ment with fake mem­o­ries of their life so far and see what they do. This al­lows us to ig­nore many things that would likely be a prob­lem in prac­tice, such as:

  • At­tempt­ing to use the util­ity func­tion to choose ac­tions be­fore it has converged

  • Distri­bu­tional shift caus­ing the learned util­ity func­tion to be­come invalid

  • Lo­cal min­ima pre­vent­ing us from learn­ing a good util­ity func­tion, or from op­ti­miz­ing the learned util­ity func­tion correctly

The next few posts in this se­quence will con­sider the suit­abil­ity of am­bi­tious value learn­ing as a solu­tion to the speci­fi­ca­tion prob­lem. Most of them will con­sider whether am­bi­tious value learn­ing is pos­si­ble in the set­ting above (in­finite com­pute and data). One post will con­sider prac­ti­cal is­sues with the ap­pli­ca­tion of IRL to in­fer a util­ity func­tion suit­able for am­bi­tious value learn­ing, while still as­sum­ing that the re­sult­ing util­ity func­tion can be perfectly max­i­mized (which is equiv­a­lent to as­sum­ing in­finite com­pute and a perfect model of the en­vi­ron­ment af­ter IRL has run).