The human side of interaction

The last few posts have mo­ti­vated an anal­y­sis of the hu­man-AI sys­tem rather than an AI sys­tem in iso­la­tion. So far we’ve looked at the no­tion that the AI sys­tem should get feed­back from the user and that it could use re­ward un­cer­tainty for cor­rigi­bil­ity. Th­ese are fo­cused on the AI sys­tem, but what about the hu­man? If we build a sys­tem that ex­plic­itly so­lic­its feed­back from the hu­man, what do we have to say about the hu­man policy, and how the hu­man should provide feed­back?

In­ter­pret­ing hu­man actions

One ma­jor free vari­able in any ex­plicit in­ter­ac­tion or feed­back mechanism is what se­man­tics the AI sys­tem should at­tach to the hu­man feed­back. The clas­sic ex­am­ples of AI risk are usu­ally de­scribed in a way where this is the prob­lem: when we provide a re­ward func­tion that re­wards pa­per­clips, the AI sys­tem in­ter­prets it liter­ally and max­i­mizes pa­per­clips, rather than in­ter­pret­ing it prag­mat­i­cally as an­other hu­man would.

(Aside: I sus­pect this was not the origi­nal point of the pa­per­clip max­i­mizer, but it has be­come a very pop­u­lar retel­ling, so I’m us­ing it any­way.)

Model­ing this clas­sic ex­am­ple as a hu­man-AI sys­tem, we can see that the prob­lem is that the hu­man is offer­ing a form of “feed­back”, the re­ward func­tion, and the AI sys­tem is not as­cribing the cor­rect se­man­tics to it. The way it uses the re­ward func­tion im­plies that the re­ward func­tion en­codes the op­ti­mal be­hav­ior of the AI sys­tem in all pos­si­ble en­vi­ron­ments—a mo­ment’s thought is suffi­cient to see that this is not ac­tu­ally the case. There will definitely be many cases and en­vi­ron­ments that the hu­man did not con­sider when de­sign­ing the re­ward func­tion, and we should not ex­pect that the re­ward func­tion in­cen­tivizes the right be­hav­ior in those cases.

So what can the AI sys­tem as­sume if the hu­man pro­vides it a re­ward func­tion? In­verse Re­ward De­sign (IRD) offers one an­swer: the hu­man is likely to provide a par­tic­u­lar re­ward func­tion if it leads to high true util­ity be­hav­ior in the train­ing en­vi­ron­ment. So, in the boat race ex­am­ple, if we are given the re­ward “max­i­mize score” on a train­ing en­vi­ron­ment where this ac­tu­ally leads to win­ning the race, then “max­i­mize score” and “win the race” are about equally likely re­ward func­tions, since they would both lead to the same be­hav­ior in the train­ing en­vi­ron­ment. Once the AI sys­tem is de­ployed on the en­vi­ron­ment in the blog post, it would no­tice that the two likely re­ward func­tions in­cen­tivize very differ­ent be­hav­ior. At that point, it could get more feed­back from hu­mans, or it could do some­thing that is good ac­cord­ing to both re­ward func­tions. The pa­per takes the lat­ter ap­proach, us­ing risk-averse plan­ning to op­ti­mize the worst-case be­hav­ior.

Similarly, with in­verse re­in­force­ment learn­ing (IRL), or learn­ing from prefer­ences, we need to make some sort of as­sump­tion about the se­man­tics of the hu­man demon­stra­tions or prefer­ences. A typ­i­cal as­sump­tion is Boltz­mann ra­tio­nal­ity: the hu­man is as­sumed to take bet­ter ac­tions with higher prob­a­bil­ity. This effec­tively mod­els all hu­man bi­ases and sub­op­ti­mal­ities as noise. There are pa­pers that ac­count for bi­ases rather than mod­el­ing them as noise. A ma­jor ar­gu­ment against the fea­si­bil­ity of am­bi­tious value learn­ing is that any as­sump­tion we make will be mis­speci­fied, and so we can­not in­fer the “one true util­ity func­tion”. How­ever, it seems plau­si­ble that we could have an as­sump­tion that would al­low us to learn some val­ues (at least to the level that hu­mans are able to).

The hu­man policy

Another im­por­tant as­pect is how the hu­man ac­tu­ally com­putes feed­back. We could imag­ine train­ing hu­man over­seers to provide feed­back in the man­ner that the AI sys­tem ex­pects. Cur­rently we “train” AI re­searchers to provide re­ward func­tions that in­cen­tivize the right be­hav­ior in the AI sys­tems. With IRD, we only need the hu­man to ex­ten­sively test their re­ward func­tion in the train­ing en­vi­ron­ment and make sure the re­sult­ing be­hav­ior is near op­ti­mal, with­out wor­ry­ing too much about gen­er­al­iza­tion to other en­vi­ron­ments. With IRL, the hu­man needs to provide demon­stra­tions that are op­ti­mal. And so on.

(Aside: This is very rem­i­nis­cent of hu­man-com­puter in­ter­ac­tion, and in­deed I think a use­ful frame is to view this as the prob­lem of giv­ing hu­mans bet­ter, eas­ier-to-use tools to con­trol the be­hav­ior of the AI sys­tem. We started with di­rect pro­gram­ming, then im­proved upon that to re­ward func­tions, and are now try­ing to im­prove to com­par­i­sons, rank­ings, and demon­stra­tions.)

We might also want to train hu­mans to give more care­ful an­swers than they would have oth­er­wise. For ex­am­ple, it seems re­ally good if our AI sys­tems learn to pre­serve op­tion value in the face of un­cer­tainty. We might want our over­seers to think deeply about po­ten­tial con­se­quences, be risk-averse in their de­ci­sion-mak­ing, and pre­serve op­tion value with their choices, so that the AI sys­tem learns to do the same. (The de­tails de­pend strongly on the par­tic­u­lar nar­row value learn­ing al­gorithm—the best hu­man policy for IRL will be very differ­ent from the best hu­man policy for CIRL.) We might hope that this re­quire­ment only lasts for a short amount of time, af­ter which our AI sys­tems have learnt the rele­vant con­cepts suffi­ciently well that we can be a bit more lax in our feed­back.

Learn­ing hu­man reasoning

So far I’ve been an­a­lyz­ing AI sys­tems where the feed­back is given ex­plic­itly, and there is a ded­i­cated al­gorithm for han­dling the feed­back. Does the anal­y­sis also ap­ply to sys­tems which get feed­back im­plic­itly, like iter­ated am­plifi­ca­tion and de­bate?

Well, cer­tainly these meth­ods will need to get feed­back some­how, but they may not face the prob­lem of as­cribing se­man­tics to the feed­back, since they may have learned the se­man­tics im­plic­itly. For ex­am­ple, a suffi­ciently pow­er­ful imi­ta­tion learn­ing al­gorithm will be able to do nar­row value learn­ing sim­ply be­cause hu­mans are ca­pa­ble of nar­row value learn­ing, even though it has no ex­plicit as­sump­tion of se­man­tics of the feed­back. In­stead, it has in­ter­nal­ized the se­man­tics that we hu­mans give to other hu­mans’ speech.

Similarly, both iter­ated am­plifi­ca­tion and de­bate in­herit the se­man­tics from hu­mans by learn­ing how hu­mans rea­son. So they do not have the prob­lems listed above. Nev­er­the­less, it prob­a­bly still is valuable to train hu­mans to be good over­seers for other rea­sons. For ex­am­ple, in de­bate, the hu­man judges are sup­posed to say which AI sys­tem pro­vided the most true and use­ful in­for­ma­tion. It is cru­cial that the hu­mans judge by this crite­rion, in or­der to provide the right in­cen­tives for the AI sys­tems in the de­bate.


If we reify the in­ter­ac­tion be­tween the hu­man and the AI sys­tem, then the AI sys­tem must make some as­sump­tion about the mean­ing of the hu­man’s feed­back. The hu­man should also make sure to provide feed­back that will be in­ter­preted cor­rectly by the AI sys­tem.