Future directions for ambitious value learning

To re­cap the se­quence so far:

  • Am­bi­tious value learn­ing aims to in­fer a util­ity func­tion that is safe to max­i­mize, by look­ing at hu­man be­hav­ior.

  • How­ever, since you only ob­serve hu­man be­hav­ior, you must be able to in­fer and ac­count for the mis­takes that hu­mans make in or­der to ex­ceed hu­man perfor­mance. (If we don’t ex­ceed hu­man perfor­mance, it’s likely that we’ll use un­safe tech­niques that do ex­ceed hu­man perfor­mance, due to eco­nomic in­cen­tives.)

  • You might hope to in­fer both the mis­take model (aka sys­tem­atic hu­man bi­ases) and the util­ity func­tion, and then throw away the mis­take model and op­ti­mize the util­ity func­tion. This can­not be done with­out ad­di­tional as­sump­tions.

  • One po­ten­tial as­sump­tion you could use would be to cod­ify a spe­cific mis­take model. How­ever, hu­mans are suffi­ciently com­pli­cated that any such model would be wrong, lead­ing to model mis­speci­fi­ca­tion. Model mis­speci­fi­ca­tion causes many prob­lems in gen­eral, and is par­tic­u­larly thorny for value learn­ing.

De­spite these ar­gu­ments, we could still hope to in­fer a broad util­ity func­tion that is safe to op­ti­mize, ei­ther by sidestep­ping the for­mal­ism used so far, or by in­tro­duc­ing ad­di­tional as­sump­tions. Often, it is clear that these meth­ods would not find the true hu­man util­ity func­tion (as­sum­ing that such a thing ex­ists), but they are worth pur­su­ing any­way be­cause they could find a util­ity func­tion that is good enough.

This post pro­vides poin­t­ers to ap­proaches that are cur­rently be­ing pur­sued. Since these are ac­tive ar­eas of re­search, I don’t want to com­ment on how fea­si­ble they may or may not be—it’s hard to ac­cu­rately as­sess the im­por­tance and qual­ity of an idea that is be­ing de­vel­oped just from what is cur­rently writ­ten down about that idea.

As­sump­tions about the mis­take model. We could nar­row down on the mis­take model by mak­ing as­sump­tions about it, that could let us avoid the im­pos­si­bil­ity re­sult. This de­ci­sion means that we’re ac­cept­ing the risk of mis­speci­fi­ca­tion—but per­haps as long as the mis­take model is not too mis­speci­fied, the out­come will still be good.

Learn­ing the Prefer­ences of Ig­no­rant, In­con­sis­tent Agents shows how to in­fer util­ity func­tions when you have an ex­act mis­take model, such as “the hu­man is a hy­per­bolic time dis­counter”. (Learn­ing the Prefer­ences of Bounded Agents and the on­line book Model­ing Agents with Prob­a­bil­is­tic Pro­grams cover similar ground.)

In­fer­ring Re­ward Func­tions from De­mon­stra­tors with Un­known Bi­ases takes this a step fur­ther by si­mul­ta­neously learn­ing the mis­take model and the util­ity func­tion, while mak­ing weaker as­sump­tions on the mis­take model than “the hu­man is nois­ily op­ti­mal”. Of course, it does still make as­sump­tions, or it would fall prey to the im­pos­si­bil­ity re­sult (in par­tic­u­lar, it would be likely to in­fer the nega­tive of the “true” util­ity func­tion).

The struc­ture of the plan­ning al­gorithm. Avoid­ing the im­pos­si­bil­ity re­sult re­quires us to dis­t­in­guish be­tween (plan­ner, re­ward) pairs that lead to the same policy. One ap­proach is to look at the in­ter­nal struc­ture of the plan­ner (this cor­re­sponds to look­ing in­side the brains of in­di­vi­d­ual hu­mans). I like this post as an in­tro­duc­tion, but many of Stu­art Arm­strong’s other posts are tack­ling some as­pect of this prob­lem. There is also work that aims to build a psy­cholog­i­cal model of what con­sti­tutes hu­man val­ues, and use that to in­fer val­ues, de­scribed in more de­tail (with cita­tions) in this com­ment.

As­sump­tions about the re­la­tion of be­hav­ior to prefer­ences. One of the most per­plex­ing parts of the im­pos­si­bil­ity the­o­rem is that we can’t dis­t­in­guish be­tween fully ra­tio­nal and fully anti-ra­tio­nal be­hav­ior, yet we hu­mans seem to do this eas­ily. Per­haps this is be­cause we have built-in pri­ors that re­late ob­ser­va­tions of be­hav­ior to prefer­ences, which we could im­part to our AI sys­tems. For ex­am­ple, we could en­code the as­sump­tion that re­gret is bad, or that ly­ing about val­ues is similar to ly­ing about facts.

From the per­spec­tive of the se­quence so far, both things we say and things we do count as “hu­man be­hav­ior”. But per­haps we could add in an as­sump­tion that in­fer­ences from speech and in­fer­ences from ac­tions should mostly agree, and have rules about what to do if they don’t agree. While there is a lot of work that uses nat­u­ral lan­guage to guide some other learn­ing pro­cess, I don’t know of any work that tries to re­solve con­flicts be­tween speech and ac­tions (or mul­ti­modal in­put more gen­er­ally), but it’s some­thing that I’m op­ti­mistic about. Ac­knowl­edg­ing Hu­man Prefer­ence Types to Sup­port Value Learn­ing ex­plores this prob­lem in more de­tail, sug­gest­ing some ag­gre­ga­tion rules, but doesn’t test any of these rules on real prob­lems.

Other schemes for learn­ing util­ity func­tions. One could imag­ine par­tic­u­lar ways that value learn­ing could go which would re­sult in learn­ing a good util­ity func­tion. Th­ese cases typ­i­cally can be re­cast as mak­ing some as­sump­tion about the mis­take model.

For ex­am­ple, this com­ment pro­poses that the AI first asks hu­mans how they would like their life to be while they figure out their util­ity func­tion, and then uses that in­for­ma­tion to com­pute a dis­tri­bu­tion of “preferred” lives from which it learns the full util­ity func­tion. The rest of the thread is a good ex­am­ple of ap­ply­ing the “mis­take model” way of think­ing to a pro­posal that does not ob­vi­ously fit in its frame­work. There has been much more think­ing spread across many posts and com­ment threads in a similar vein that I haven’t col­lected, but you might be able to find some of it by look­ing at dis­cus­sions be­tween Paul Chris­ti­ano and Wei Dai.

Re­solv­ing hu­man val­ues, com­pletely and ad­e­quately pre­sents an­other frame­work that aims for an ad­e­quate util­ity func­tion in­stead of a perfect one.

Be­sides the ap­proaches above, which still seek to in­fer a sin­gle util­ity func­tion, there are a few other re­lated ap­proaches:

Tol­er­at­ing a mildly mis­speci­fied util­ity func­tion. The ideas of satis­fic­ing and mild op­ti­miza­tion are try­ing to make us more ro­bust to a mis­speci­fied util­ity func­tion, by re­duc­ing how much we op­ti­mize the util­ity func­tion. The key ex­am­ple of this is quan­tiliz­ers, which se­lect an ac­tion ran­domly from the top N% of ac­tions from some dis­tri­bu­tion, sorted by ex­pected util­ity.

Uncer­tainty over util­ity func­tions. Much work in value learn­ing in­volves un­cer­tainty over util­ity func­tions. This does not fix the is­sues pre­sented so far—we can con­sider what would hap­pen if the AI up­dated on all pos­si­ble in­for­ma­tion about the util­ity func­tion. At that point, the AI would take the ex­pec­ta­tion of the re­sult­ing dis­tri­bu­tion, and max­i­mize that func­tion. This means that we once again end up with the AI op­ti­miz­ing a sin­gle func­tion, and all of the same prob­lems arise.

To be clear, most re­searchers do not think that un­cer­tainty is a solu­tion to these prob­lems—un­cer­tainty can be helpful for other rea­sons, which I talk about later in the se­quence. I men­tion this area of work be­cause it works in the same frame­work of an AI op­ti­miz­ing a util­ity func­tion, and I sus­pect many peo­ple will au­to­mat­i­cally as­so­ci­ate un­cer­tainty with any kind of value learn­ing since CHAI has typ­i­cally worked on both, but un­cer­tainty is typ­i­cally not tar­get­ing the prob­lem of learn­ing a util­ity func­tion that is safe to max­i­mize.