Future directions for narrow value learning

Nar­row value learn­ing is a huge field that peo­ple are already work­ing on (though not by that name) and I can’t pos­si­bly do it jus­tice. This post is pri­mar­ily a list of things that I think are im­por­tant and in­ter­est­ing, rather than an ex­haus­tive list of di­rec­tions to pur­sue. (In con­trast, the cor­re­spond­ing post for am­bi­tious value learn­ing did aim to be ex­haus­tive, and I don’t think I missed much work there.)

You might think that since so many peo­ple are already work­ing on nar­row value learn­ing, we should fo­cus on more ne­glected ar­eas of AI safety. How­ever, I still think it’s worth work­ing on be­cause long-term safety sug­gests a par­tic­u­lar sub­set of prob­lems to fo­cus on; that sub­set seems quite ne­glected.

For ex­am­ple, a lot of work is about how to im­prove cur­rent al­gorithms in a par­tic­u­lar do­main, and the solu­tions en­code do­main knowl­edge to suc­ceed. This seems not very rele­vant for long-term con­cerns. Some work as­sumes that a hand­coded fea­tur­iza­tion is given (so that the true re­ward is lin­ear in the fea­tures); this is not an as­sump­tion we could make for more pow­er­ful AI sys­tems.

I will spec­u­late a bit on the ne­glect­ed­ness and fea­si­bil­ity of each of these ar­eas, since for many of them there isn’t a per­son or re­search group who would cham­pion them whom I could defer to about the ar­gu­ments for suc­cess.

The big picture

This cat­e­gory of re­search is about how you could take nar­row value learn­ing al­gorithms and use them to cre­ate an al­igned AI sys­tem. Typ­i­cally, I ex­pect this to work by hav­ing the nar­row value learn­ing en­able some form of cor­rigi­bil­ity.

As far as I can tell, no­body out­side of the AI safety com­mu­nity works on this prob­lem. While it is far too early to stake a con­fi­dent po­si­tion one way or the other, I am slightly less op­ti­mistic about this av­enue of ap­proach than one in which we cre­ate a sys­tem that is di­rectly trained to be cor­rigible.

Avoid­ing prob­lems with goal-di­rect­ed­ness. How do we put to­gether nar­row value learn­ing tech­niques in a way that doesn’t lead to the AI be­hav­ing like a goal-di­rected agent at each point? This is the prob­lem with keep­ing a re­ward es­ti­mate that is up­dated over time. While re­ward un­cer­tainty can help avoid some of the prob­lems, it does not seem suffi­cient by it­self. Are there other ideas that can help?

Deal­ing with the difficulty of “hu­man val­ues”. Co­op­er­a­tive IRL makes the un­re­al­is­tic as­sump­tion that the hu­man knows her re­ward func­tion ex­actly. How can we make nar­row value learn­ing sys­tems that deal with this is­sue? In par­tic­u­lar, what pre­vents them from up­dat­ing on our be­hav­ior that’s not in line with our “true val­ues”, while still let­ting them up­date on other be­hav­ior? Per­haps we could make an AI sys­tem that is always un­cer­tain about what the true re­ward is, but how does this mesh with epistemics, which sug­gest that you can get to ar­bi­trar­ily high con­fi­dence given suffi­cient ev­i­dence?

Hu­man-AI interaction

This sec­tion of re­search aims to figure out how to cre­ate hu­man-AI sys­tems that suc­cess­fully ac­com­plish tasks. For suffi­ciently com­plex tasks and suffi­ciently pow­er­ful AI, this over­laps with the big pic­ture con­cerns above, but there are also ar­eas to work on with sub­hu­man AI with an eye to­wards more pow­er­ful sys­tems.

As­sump­tions about the hu­man. In any feed­back sys­tem, the up­date that the AI makes on the hu­man feed­back de­pends on the as­sump­tion that the AI makes about the hu­man. In In­verse Re­ward De­sign (IRD), the AI sys­tem as­sumes that the re­ward func­tion pro­vided by a hu­man de­signer leads to near-op­ti­mal be­hav­ior in the train­ing en­vi­ron­ment, but may be ar­bi­trar­ily bad in other en­vi­ron­ments. In IRL, the typ­i­cal as­sump­tion is that the demon­stra­tions are cre­ated by a hu­man be­hav­ing Boltz­mann ra­tio­nally, but re­cent re­search aims to also cor­rect for any sub­op­ti­mal­ities they might have, and so no longer as­sumes away the prob­lem of sys­tem­atic bi­ases. (See also the dis­cus­sion in Fu­ture di­rec­tions for am­bi­tious value learn­ing.) In Co­op­er­a­tive IRL, the AI sys­tem as­sumes that the hu­man mod­els the AI sys­tem as ap­prox­i­mately ra­tio­nal. COACH notes that when you ask a hu­man to provide a re­ward sig­nal, they provide a cri­tique of cur­rent be­hav­ior rather than a re­ward sig­nal that can be max­i­mized.

Can we weaken the as­sump­tions that we have to make, or get rid of them al­to­gether? Bar­ring that, can we make our as­sump­tions more re­al­is­tic?

Manag­ing in­ter­ac­tion. How should the AI sys­tem man­age its in­ter­ac­tion with the hu­man to learn best? This is the do­main of ac­tive learn­ing, which is far too large a field for me to sum­ma­rize here. I’ll throw in a link to Ac­tive In­verse Re­ward De­sign, be­cause I already talked about IRD and I helped write the ac­tive var­i­ant.

Hu­man policy. The util­ity of a feed­back sys­tem is go­ing to de­pend strongly on the qual­ity of the feed­back given by the hu­man. How do we train hu­mans so that their feed­back is most use­ful for the AI sys­tem? So far, most work is about how to adapt AI sys­tems to un­der­stand hu­mans bet­ter, but it seems likely there are also gains to be had by hav­ing hu­mans adapt to AI sys­tems.

Find­ing and us­ing prefer­ence information

New sources of data. So far prefer­ences are typ­i­cally learned through demon­stra­tions, com­par­i­sons or rank­ings; but there are likely other use­ful ways to elicit prefer­ences. In­verse Re­ward De­sign gets prefer­ences from a stated proxy re­ward func­tion. An ob­vi­ous one is to learn prefer­ences from what peo­ple say, but nat­u­ral lan­guage is no­to­ri­ously hard to work with so not much work has been done on it so far, though there is some. (I’m pretty sure there’s a lot more in the NLP com­mu­nity that I’m not yet aware of.) We re­cently showed that there is even prefer­ence in­for­ma­tion in the state of the world that can be ex­tracted.

Han­dling mul­ti­ple sources of data. We could in­fer prefer­ences from be­hav­ior, from speech, from given re­ward func­tions, from the state of the world, etc. but it seems quite likely that the in­ferred prefer­ences would con­flict with each other. What do you do in these cases? Is there a way to in­fer prefer­ences si­mul­ta­neously from all the sources of data such that the prob­lem does not arise? (And if so, what is the al­gorithm im­plic­itly do­ing in cases where differ­ent data sources pull in differ­ent di­rec­tions?)

Ac­knowl­edg­ing Hu­man Prefer­ence Types to Sup­port Value Learn­ing talks about this prob­lem and sug­gests some ag­gre­ga­tion rules but doesn’t test them. Re­ward Learn­ing from Nar­rated De­mon­stra­tions learns from both speech and demon­stra­tions, but they are used as com­ple­ments to each other, not as differ­ent sources for the same in­for­ma­tion that could con­flict.

I’m par­tic­u­larly ex­cited about this line of re­search—it seems like it hasn’t been ex­plored yet and there are things that can be done, es­pe­cially if you al­low your­self to sim­ply de­tect con­flicts, pre­sent the con­flict to the user, and then trust their an­swer. (Though this wouldn’t scale to su­per­in­tel­li­gent AI.)

Gen­er­al­iza­tion. Cur­rent deep IRL al­gorithms (or deep any­thing al­gorithms) do not gen­er­al­ize well. How can we in­fer re­ward func­tions that trans­fer well to differ­ent en­vi­ron­ments? Ad­ver­sar­ial IRL is an ex­am­ple of work push­ing in this di­rec­tion, but my un­der­stand­ing is that it had limited suc­cess. I’m less op­ti­mistic about this av­enue of re­search be­cause it seems like in gen­eral func­tion ap­prox­i­ma­tors do not ex­trap­o­late well. On the other hand, I and ev­ery­one else have the strong in­tu­ition that a re­ward func­tion should take fewer bits to spec­ify than the full policy, and so should be eas­ier to in­fer. (Though not based on Kol­mogorov com­plex­ity.)