Human-AI Interaction

The im­por­tance of feedback

Con­sider try­ing to pro­gram a self-driv­ing car to drive from San Fran­cisco to Los An­ge­les—with no sen­sors that al­low it to gather in­for­ma­tion as it is driv­ing. This is pos­si­ble in prin­ci­ple. If you can pre­dict the ex­act weather con­di­tions, the ex­act move­ment of all of the other cars on the road, the ex­act amount of fric­tion along ev­ery part of the road sur­face, the ex­act im­pact of (the equiv­a­lents of) press­ing the gas or turn­ing the steer­ing wheel, and so on, then you could com­pute ahead of time how ex­actly to con­trol the car such that it gets from SF to LA. Nev­er­the­less, it seems un­likely that we will ever be able to ac­com­plish such a feat, even with pow­er­ful AI sys­tems.

No, in prac­tice there is go­ing to be some un­cer­tainty about how the world is go­ing to evolve; such that any plan com­puted ahead of time will have some er­rors that will com­pound over the course of the plan. The solu­tion is to use sen­sors to gather in­for­ma­tion while ex­e­cut­ing the plan, so that we can no­tice any er­rors or de­vi­a­tions from the plan, and take cor­rec­tive ac­tion. It is much eas­ier to build a con­trol­ler that keeps you pointed in the gen­eral di­rec­tion, than to build a plan that will get you there perfectly with­out any adap­ta­tion.

Con­trol the­ory stud­ies these sorts of sys­tems, and you can see the gen­eral power of feed­back con­trol­lers in the the­o­rems that can be proven. Espe­cially for mo­tion tasks, you can build feed­back con­trol­lers that are guaran­teed to safely achieve the goal, even in the pres­ence of ad­ver­sar­ial en­vi­ron­men­tal forces (that are bounded in size, so you can’t have ar­bi­trar­ily strong wind). In the pres­ence of an ad­ver­sary, in most en­vi­ron­ments it be­comes im­pos­si­ble even in prin­ci­ple to make such a guaran­tee if you do not have any sen­sors or feed­back and must com­pute a plan in ad­vance. Typ­i­cally, for ev­ery such plan, there is some en­vi­ron­men­tal force that would cause it to fail.

The con­trol the­ory per­spec­tive on AI alignment

With am­bi­tious value learn­ing, we’re hop­ing that we can learn a util­ity func­tion that tells us the op­ti­mal thing to do into the fu­ture. You need to be able to en­code ex­actly how to be­have in all pos­si­ble en­vi­ron­ments, no mat­ter what new things hap­pen in the fu­ture, even if it’s some­thing we hu­mans never con­sid­ered a pos­si­bil­ity so far.

This is analo­gous to the prob­lem of try­ing to pro­gram a self-driv­ing car. Just as in that case, we might hope that we can solve the prob­lem by in­tro­duc­ing sen­sors and feed­back. In this case, the “feed­back” would be hu­man data that in­forms our AI sys­tem what we want it to do, that is, data that can be used to learn val­ues. The evolu­tion of hu­man val­ues and prefer­ences in new en­vi­ron­ments with new tech­nolo­gies is analo­gous to the un­pre­dictable en­vi­ron­men­tal dis­tur­bances that con­trol the­ory as­sumes.

This does not mean that an AI sys­tem must be ar­chi­tected in such a way that hu­man data is ex­plic­itly used to “con­trol” the AI ev­ery few timesteps in or­der to keep it on track. It does mean that any AI al­ign­ment pro­posal should have some method of in­cor­po­rat­ing in­for­ma­tion about what hu­mans want in rad­i­cally differ­ent cir­cum­stances. I have found this an im­por­tant frame with which to view AI al­ign­ment pro­pos­als. For ex­am­ple, with in­di­rect nor­ma­tivity or ideal­ized hu­mans it’s im­por­tant that the ideal­ized or simu­lated hu­mans are go­ing through similar ex­pe­riences that real hu­mans go through, so that they provide good feed­back.

Feed­back through interaction

Of course, while the con­trol the­ory per­spec­tive does not re­quire the feed­back con­trol­ler to be ex­plicit, one good way to en­sure that there is feed­back would be to make it ex­plicit. This would mean that we cre­ate an AI sys­tem that ex­plic­itly col­lects fresh data about what hu­mans want in or­der to in­form what it should do. This is ba­si­cally call­ing for an AI sys­tem that is con­stantly us­ing tools from nar­row value learn­ing to figure out what to do. In prac­tice, this will re­quire in­ter­ac­tion be­tween the AI and the hu­man. How­ever, there are still is­sues to think about:

Con­ver­gent in­stru­men­tal sub­goals: A sim­ple way of im­ple­ment­ing hu­man-AI in­ter­ac­tion would be to have an es­ti­mate of a re­ward func­tion that is con­tinu­ally up­dated us­ing nar­row value learn­ing. When­ever the AI needs to choose an ac­tion, it uses the cur­rent re­ward es­ti­mate to choose.

With this sort of setup, we still have the prob­lem that we are max­i­miz­ing a re­ward func­tion which leads to con­ver­gent in­stru­men­tal sub­goals. In par­tic­u­lar, the plan “dis­able the nar­row value learn­ing sys­tem” is likely very good ac­cord­ing to the cur­rent es­ti­mate of the re­ward func­tion, be­cause it pre­vents the re­ward from chang­ing caus­ing all fu­ture ac­tions to con­tinue to op­ti­mize the cur­rent re­ward es­ti­mate.

Another way of see­ing that this setup is a bit weird is that it has in­con­sis­tent prefer­ences over time—at any given point in time, it treats the ex­pected change in its re­ward as an ob­sta­cle that should be un­done if pos­si­ble.

That said, it is worth not­ing that in this setup, the goal-di­rect­ed­ness is com­ing from the hu­man. In fact, any ap­proach where goal-di­rect­ed­ness comes from the hu­man re­quires some form of hu­man-AI in­ter­ac­tion. We might hope that some sys­tem of this form al­lows us to have a hu­man-AI sys­tem that is over­all goal-di­rected (in or­der to achieve eco­nomic effi­ciency), while the AI sys­tem it­self is not goal-di­rected, and so the over­all sys­tem pur­sues the hu­man’s in­stru­men­tal sub­goals. The next post will talk about re­ward un­cer­tainty as a po­ten­tial ap­proach to get this be­hav­ior.

Hu­mans are un­able to give feed­back: As our AI sys­tems be­come more and more pow­er­ful, we might worry that they are able to vastly out­think us, such that they would need our feed­back on sce­nar­ios that are too hard for us to com­pre­hend.

On the one hand, if we’re ac­tu­ally in this sce­nario I feel quite op­ti­mistic: if the ques­tions are so difficult that we can’t an­swer them, we’ve prob­a­bly already solved all the sim­ple parts of the re­ward, which means we’ve prob­a­bly stopped x-risk.

But even if it is im­per­a­tive that we an­swer these ques­tions ac­cu­rately, I’m still op­ti­mistic: as our AI sys­tems be­come more pow­er­ful, we can have bet­ter AI-en­abled tools that help us un­der­stand the ques­tions on which we are sup­posed to give feed­back. This could be AI sys­tems that do cog­ni­tive work on our be­half, as in re­cur­sive re­ward mod­el­ing, or it could be AI-cre­ated tech­nolo­gies that make us more ca­pa­ble, such as brain en­hance­ment or the abil­ity to be up­loaded and have big­ger “brains” that can un­der­stand larger things.

Hu­mans don’t know the goal: An im­por­tant dis­anal­ogy be­tween the con­trol the­ory/​self-driv­ing car ex­am­ple and the AI al­ign­ment prob­lem is that in con­trol the­ory it is as­sumed that the gen­eral path to the des­ti­na­tion is known, and we sim­ply need to stay on it; whereas in AI al­ign­ment even the hu­man does not know the goal (i.e. the “true hu­man re­ward”). As a re­sult, we can­not rely on hu­mans to always provide ad­e­quate feed­back; we also need to man­age the pro­cess by which hu­mans learn what they want. Con­cerns about hu­man safety prob­lems and ma­nipu­la­tion fall into this bucket.


If I want an AI sys­tem that acts au­tonomously over a long pe­riod of time, but it isn’t do­ing am­bi­tious value learn­ing (only nar­row value learn­ing), then we nec­es­sar­ily re­quire a feed­back mechanism that keeps the AI sys­tem “on track” (since my in­stru­men­tal val­ues will change over that pe­riod of time).

While the feed­back mechanism need not be ex­plicit (and could arise sim­ply be­cause it is an effec­tive way to ac­tu­ally help me), we could con­sider AI de­signs that have an ex­plicit feed­back mechanism. There are still many prob­lems with such a de­sign, most no­tably that the ob­vi­ous de­sign has the prob­lem that at any given point the AI sys­tem looks like it could be goal-di­rected with a long-term re­ward func­tion, which is the sort of sys­tem that we are most wor­ried about.