What is narrow value learning?

Am­bi­tious value learn­ing aims to achieve su­per­hu­man perfor­mance by figur­ing out the un­der­ly­ing la­tent “val­ues” that hu­mans have, and eval­u­at­ing new situ­a­tions ac­cord­ing to these val­ues. In other words, it is try­ing to in­fer the crite­ria by which we judge situ­a­tions to be good. This is par­tic­u­larly hard be­cause in novel situ­a­tions that hu­mans haven’t seen yet, we haven’t even de­vel­oped the crite­ria by which we would eval­u­ate. (This is one of the rea­sons why we need to model hu­mans as sub­op­ti­mal, which causes prob­lems.)

In­stead of this, we can use nar­row value learn­ing, which pro­duces be­hav­ior that we want in some nar­row do­main, with­out ex­pect­ing gen­er­al­iza­tion to novel cir­cum­stances. The sim­plest form of this is imi­ta­tion learn­ing, where the AI sys­tem sim­ply tries to imi­tate the su­per­vi­sor’s be­hav­ior. This limits the AI’s perfor­mance to that of its su­per­vi­sor. We could also learn from prefer­ences over be­hav­ior, which can scale to su­per­hu­man perfor­mance, since the su­per­vi­sor can of­ten eval­u­ate whether a par­tic­u­lar be­hav­ior meets our prefer­ences even if she can’t perform it her­self. We could also teach our AI sys­tems to perform tasks that we would not want to do our­selves, such as han­dling hot ob­jects.

Nearly all of the work on prefer­ence learn­ing, in­clud­ing most work on in­verse re­in­force­ment learn­ing (IRL), is aimed at nar­row value learn­ing. IRL is of­ten ex­plic­itly stated to be a tech­nique for imi­ta­tion learn­ing, and early al­gorithms phrase the prob­lem as match­ing the fea­tures in the demon­stra­tion, not ex­ceed­ing them. The few al­gorithms that try to gen­er­al­ize to differ­ent test dis­tri­bu­tions, such as AIRL, are only aiming for rel­a­tively small amounts of gen­er­al­iza­tion.

(Why use IRL in­stead of be­hav­ioral clon­ing, where you mimic the ac­tions that the demon­stra­tor took? The hope is that IRL gives you a good in­duc­tive bias for imi­ta­tion, al­low­ing you to be more sam­ple effi­cient and to gen­er­al­ize a lit­tle bit.)

You might have no­ticed that I talk about nar­row value learn­ing in terms of ac­tual ob­served be­hav­ior from the AI sys­tem, as op­posed to any sort of “prefer­ences” or “val­ues” that are in­ferred. This is be­cause I want to in­clude ap­proaches like imi­ta­tion learn­ing, or meta learn­ing for quick task iden­ti­fi­ca­tion and perfor­mance. Th­ese ap­proaches can pro­duce be­hav­ior that we want with­out hav­ing an ex­plicit rep­re­sen­ta­tion of “prefer­ences”. In prac­tice any method that scales to hu­man in­tel­li­gence is go­ing to have to in­fer prefer­ences, though per­haps im­plic­itly.

Since any in­stance of nar­row value learn­ing is defined with re­spect to some do­main or in­put dis­tri­bu­tion on which it gives sen­si­ble re­sults, we can rank them ac­cord­ing to how gen­eral this in­put dis­tri­bu­tion is. An al­gorithm that figures out what food I like to eat is very do­main-spe­cific, whereas one that de­ter­mines my life goals and suc­cess­fully helps me achieve them in both the long and short term is very gen­eral. When the in­put dis­tri­bu­tion is “all pos­si­ble in­puts”, we have a sys­tem that has good be­hav­ior ev­ery­where, rem­i­nis­cent of am­bi­tious value learn­ing.

(An­noy­ingly, I defined am­bi­tious value learn­ing to be about the defi­ni­tion of op­ti­mal be­hav­ior, such as an in­ferred util­ity func­tion, while nar­row value learn­ing is about the ob­served be­hav­ior. So re­ally the most gen­eral ver­sion of nar­row value learn­ing is equiv­a­lent to “am­bi­tious value learn­ing plus some method of ac­tu­ally ob­tain­ing the defined be­hav­ior in prac­tice, such as by us­ing deep RL”.)


To­mor­row’s AI Align­ment Fo­rum se­quences post will be ‘Direc­tions for AI Align­ment’ by Paul Chris­ti­ano in the se­quence on iter­ated am­plifi­ca­tion.

The next post in this se­quence will be ‘Am­bi­tious vs. nar­row value learn­ing’ by Paul Chris­ti­ano, on Fri­day 11th Jan­uary.