Conclusion to the sequence on value learning

This post sum­ma­rizes the se­quence on value learn­ing. While it doesn’t in­tro­duce any new ideas, it does shed light on which parts I would em­pha­size most, and the take­aways I hope that read­ers get. I make sev­eral strong claims here; in­ter­pret these as my im­pres­sions, not my be­liefs. I would guess many re­searchers dis­agree with the (strength of the) claims, though I do not know what their ar­gu­ments would be.

Over the last three months we’ve cov­ered a lot of ground. It’s easy to lose sight of the over­all pic­ture over such a long pe­riod of time, so let’s do a brief re­cap.

The “ob­vi­ous” approach

Here is an ar­gu­ment for the im­por­tance of AI safety:

  • Any agent that is much more in­tel­li­gent than us should not be ex­ploitable by us, since if we could find some way to ex­ploit the agent, the agent could also find the ex­ploit and patch it.

  • Any­thing that is not ex­ploitable must be an ex­pected util­ity max­i­mizer; since we can­not ex­ploit a su­per­in­tel­li­gent AI, it must look like an ex­pected util­ity max­i­mizer to us.

  • Due to Good­hart’s Law, even “slightly wrong” util­ity func­tions can lead to catas­trophic out­comes when max­i­mized.

  • Our util­ity func­tion is com­plex and frag­ile, so get­ting the “right” util­ity func­tion is difficult.

This ar­gu­ment im­plies that by the time we have a su­per­in­tel­li­gent AI sys­tem, there is only one part of that sys­tem that could still have been in­fluenced by us: the util­ity func­tion. Every other fea­ture of the AI sys­tem is fixed by math. As a re­sult, we must nec­es­sar­ily solve AI al­ign­ment by in­fluenc­ing the util­ity func­tion.

So of course, the nat­u­ral ap­proach is to get the right util­ity func­tion, or at least an ad­e­quate one, and have our AI sys­tem op­ti­mize that util­ity func­tion. Be­sides frag­ility of value, which you might hope that ma­chine learn­ing could over­come, the big challenge is that even if you as­sume full ac­cess to the en­tire hu­man policy, we can­not in­fer their val­ues with­out mak­ing an as­sump­tion about how their prefer­ences re­late to their be­hav­ior. In ad­di­tion, any mis­speci­fi­ca­tion can lead to bad in­fer­ences. And fi­nally the en­tire pro­ject of hav­ing a sin­gle util­ity func­tion that cap­tures op­ti­mal be­hav­ior in all pos­si­ble en­vi­ron­ments seems quite hard to do—it seems nec­es­sary to have some sort of feed­back from hu­mans, or you end up ex­trap­o­lat­ing in some strange way that is not nec­es­sar­ily what we “would have” wanted.

So does this mean we’re doomed? Well, there are still some po­ten­tial av­enues for res­cu­ing am­bi­tious value learn­ing, though they do look quite difficult to me. But I think we should ac­tu­ally ques­tion the as­sump­tions un­der­ly­ing our origi­nal ar­gu­ment.

Prob­lems with the stan­dard argument

Con­sider the calcu­la­tor. From the per­spec­tive of some­one be­fore the time of calcu­la­tors, this de­vice would look quite in­tel­li­gent—just look at the speed with which it can do ar­ith­metic! Nonethe­less, we can all agree that a stan­dard calcu­la­tor is not dan­ger­ous.

It also seems strange to as­cribe goals to the calcu­la­tor—while this is not wrong per se, we cer­tainly have bet­ter ways of pre­dict­ing what a calcu­la­tor will and will not do than by mod­el­ling it as an ex­pected util­ity max­i­mizer. If you model a calcu­la­tor as aiming to achieve the goal of “give ac­cu­rate math an­swers”, prob­lems arise: what if I take a ham­mer to the calcu­la­tor and then try to ask it 5 + 3? The util­ity max­i­mizer model here would say that it an­swers 8, whereas with our un­der­stand­ing of how calcu­la­tors work we know it prob­a­bly won’t give any an­swer at all. Utility max­i­miza­tion with a sim­ple util­ity func­tion is only a good model for the calcu­la­tor within a re­stricted set of en­vi­ron­men­tal cir­cum­stances and a re­stricted ac­tion space. (For ex­am­ple, we don’t model the calcu­la­tor as hav­ing ac­cess to the ac­tion, “build ar­mor that can pro­tect against ham­mer at­tacks”, be­cause oth­er­wise util­ity max­i­miza­tion would pre­dict it takes that ac­tion.)

Of course, it may be that some­thing that is gen­er­ally su­per­in­tel­li­gent will work in as broad a set of cir­cum­stances as we do, and will have as wide an ac­tion space as we do, and must still look to us like an ex­pected util­ity max­i­mizer since oth­er­wise we could Dutch book it. How­ever, if you take such a broad view, then it turns out that all be­hav­ior looks co­her­ent. There’s no math­e­mat­i­cal rea­son that an in­tel­li­gent agent must have catas­trophic be­hav­ior, since any be­hav­ior that you ob­serve is con­sis­tent with the max­i­miza­tion of some util­ity func­tion.

To be clear, while I agree with ev­ery state­ment in Op­ti­mized agent ap­pears co­her­ent, I am mak­ing the strong claim that these state­ments are vac­u­ous and by them­selves tell us noth­ing about the sys­tems that we will ac­tu­ally build. Typ­i­cally, I do not flat out dis­agree with a com­mon ar­gu­ment. I usu­ally think that the ar­gu­ment is im­por­tant and forms a piece of the pic­ture, but that there are other ar­gu­ments that push in other di­rec­tions that might be more im­por­tant. That’s not the case here: I am claiming that the ar­gu­ment that “su­per­in­tel­li­gent agents must be ex­pected util­ity max­i­miz­ers by virtue of co­her­ence ar­gu­ments” pro­vides no use­ful in­for­ma­tion, with al­most the force of a the­o­rem. My un­cer­tainty here is al­most en­tirely caused by the fact that other smart peo­ple be­lieve that this ar­gu­ment is im­por­tant and rele­vant.

I am not claiming that we don’t need to worry about AI safety since AIs won’t be ex­pected util­ity max­i­miz­ers. First of all, you can model them as ex­pected util­ity max­i­miz­ers, it’s just not use­ful. Se­cond, if we build an AI sys­tem whose in­ter­nal rea­son­ing con­sisted of max­i­miz­ing the ex­pec­ta­tion of some sim­ple util­ity func­tion, I think all of the clas­sic con­cerns ap­ply. Third, it does seem likely that hu­mans will build AI sys­tems that are “try­ing to pur­sue a goal”, and that can have all of the stan­dard con­ver­gent in­stru­men­tal sub­goals. I pro­pose that we de­scribe these sys­tems as goal-di­rected rather than ex­pected util­ity max­i­miz­ers, since the lat­ter is vac­u­ous and im­plies a level of for­mal­iza­tion that we have not yet reached. How­ever, this risk is sig­nifi­cantly differ­ent. If you be­lieved that su­per­in­tel­li­gent AI must be goal-di­rected be­cause of math, then your only re­course for safety would be to make sure that the goal is good, which is what mo­ti­vated us to study am­bi­tious value learn­ing. But if the ar­gu­ment is ac­tu­ally that AI will be goal-di­rected be­cause hu­mans will make it that way, you could try to build AI that is not goal-di­rected that can do the things that goal-di­rected AI can do, and have hu­mans build that in­stead.

Alter­na­tive solutions

Now that we aren’t forced to in­fluence just a util­ity func­tion, we can con­sider al­ter­na­tive de­signs for AI sys­tems. For ex­am­ple, we can aim for cor­rigible be­hav­ior, where the agent is try­ing to do what we want. Or we could try to learn hu­man norms, and cre­ate AI sys­tems that fol­low these norms while try­ing to ac­com­plish some task. Or we could try to cre­ate an AI ecosys­tem akin to Com­pre­hen­sive AI Ser­vices, and set up the ser­vices such that they are keep­ing each other in check. We could cre­ate sys­tems that learn how to do what we want in par­tic­u­lar do­mains, by learn­ing our in­stru­men­tal goals and val­ues, and use these as sub­sys­tems in AI sys­tems that ac­cel­er­ate progress, en­able bet­ter de­ci­sion-mak­ing, and are gen­er­ally cor­rigible. If we want to take such an ap­proach, we have an­other source of in­fluence: the hu­man policy. We can train our hu­man over­seers to provide su­per­vi­sion in a par­tic­u­lar way that leads to good be­hav­ior on the AI’s part. This is analo­gous to train­ing op­er­a­tors of com­puter sys­tems, and can benefit from in­sights from Hu­man-Com­puter In­ter­ac­tion (HCI).

Not just value learning

This se­quence is some­what mis­named: while it is or­ga­nized around value learn­ing, there are many ideas that should be of in­ter­est to re­searchers work­ing on other agen­das as well. Many of the key ideas can be used to an­a­lyze any pro­posed solu­tion for al­ign­ment (though the re­sult­ing anal­y­sis may not be very in­ter­est­ing).

The ne­ces­sity of feed­back. The main ar­gu­ment of Hu­man-AI In­ter­ac­tion is that any pro­posed solu­tion that aims to have an AI sys­tem (or a CAIS glob of ser­vices) pro­duce good out­comes over the long term needs to con­tinu­ally use data about hu­mans as feed­back in or­der to “stay on tar­get”. Here, “hu­man” is short­hand for “some­thing that we know shares our val­ues”, eg. ideal­ized hu­mans, up­loads, or suffi­ciently good imi­ta­tion learn­ing would all prob­a­bly count.

(If this point seems ob­vi­ous to you, note that am­bi­tious value learn­ing does not clearly satisfy this crite­rion, and ap­proaches like im­pact mea­sures, mild op­ti­miza­tion, and box­ing are punt­ing on this prob­lem and aiming for not-catas­trophic out­comes rather than good out­comes.)

Mis­take mod­els. We saw that am­bi­tious value learn­ing has the prob­lem that even if we as­sume perfect in­for­ma­tion about the hu­man, we can­not in­fer their val­ues with­out mak­ing an as­sump­tion about how their prefer­ences re­late to their be­hav­ior. This is an ex­am­ple of a much broader pat­tern: given that our AI sys­tems nec­es­sar­ily get feed­back from us, they must be mak­ing some as­sump­tion about how to in­ter­pret that feed­back. For any pro­posed solu­tion to al­ign­ment, we should ask what as­sump­tions the AI sys­tem is mak­ing about the feed­back it gets from us.