Research Agenda v0.9: Synthesising a human’s preferences into a utility function

I’m now in a po­si­tion where I can see a pos­si­ble route to a safe/​sur­viv­able/​friendly Ar­tifi­cial In­tel­li­gence be­ing de­vel­oped. I’d give a 10+% chance of it be­ing pos­si­ble this way, and a 95% chance that some of these ideas will be very use­ful for other meth­ods of al­ign­ment. So I thought I’d en­code the route I’m see­ing as re­search agenda; this is the first pub­lic draft of it.

Clar­ity, rigour, and prac­ti­cal­ity: that’s what this agenda needs. Writ­ing this agenda has clar­ified a lot of points for me, to the ex­tent that some of it now seems, in ret­ro­spect, just ob­vi­ous and some­what triv­ial—“of course that’s the way you have to do X”. But more clar­ifi­ca­tion is needed in the ar­eas that re­main vague. And, once these are clar­ified enough for hu­mans to un­der­stand, they need to be made math­e­mat­i­cally and log­i­cally rigor­ous—and ul­ti­mately, cashed out into code, and tested and ex­per­i­mented with.

So I’d ap­pre­ci­ate any com­ments that could help with these three goals, and wel­come any­one in­ter­ested in pur­su­ing re­search along these lines over the long-term.

0 The fun­da­men­tal idea

This agenda fits it­self into the broad fam­ily of In­verse Re­in­force­ment Learn­ing: del­e­gat­ing most of the task of in­fer­ring hu­man prefer­ences to the AI it­self. Most of the task, since it’s been shown that hu­mans need to build the right as­sump­tions into the AI, or else the prefer­ence learn­ing will fail.

To get these “right as­sump­tions”, this agenda will look into what prefer­ences ac­tu­ally are, and how they may be com­bined to­gether. There are hence four parts to the re­search agenda:

  1. A way of iden­ti­fy­ing the (par­tial[1]) prefer­ences of a given hu­man .

  2. A way for ul­ti­mately syn­the­sis­ing a util­ity func­tion that is an ad­e­quate en­cod­ing of the par­tial prefer­ences of a hu­man .

  3. Prac­ti­cal meth­ods for es­ti­mat­ing this , and how one could use the defi­ni­tion of to im­prove other sug­gested meth­ods for value-al­ign­ment.

  4. Limi­ta­tions and la­cu­nas of the agenda: what is not cov­ered. Th­ese may be av­enues of fu­ture re­search, or is­sues that can­not fit into the paradigm.

There has been a myr­iad of small posts on this topic, and most will be refer­enced here. Most of these posts are stubs that hint to a solu­tion, rather than spel­ling it out fully and rigor­ously.

The rea­son for that is to check for im­pos­si­bil­ity re­sults ahead of time. The con­struc­tion of is de­liber­ately de­signed to be ad­e­quate, rather than el­e­gant (in­deed, the search for an el­e­gant might be coun­ter­pro­duc­tive and even dan­ger­ous, if gen­uine hu­man prefer­ences get sac­ri­ficed for el­e­gance). If this ap­proach is to work, then the safety of has to be ro­bust to differ­ent de­ci­sions in the syn­the­sis pro­cess (see Sec­tion 2.8, on avoid­ing dis­asters). Thus, ini­tially, it seems more im­por­tant to find ap­prox­i­mate ideas that cover all pos­si­bil­ities, rather than hav­ing a few fully de­tailed sub-pos­si­bil­ities and sev­eral gaps.

Fi­nally, it seems that if a sub-prob­lem is not for­mally solved, we stand a much bet­ter chance of get­ting a good re­sult from “hit it with lots of ma­chine learn­ing and hope for the best”, than we would if there were huge con­cep­tual holes in the method—a con­cep­tual hole mean­ing that the rele­vant solu­tion is bro­ken in an un­fix­able way. Thus, I’m pub­lish­ing this agenda now, where I see many im­ple­men­ta­tion holes, but no large con­cep­tual holes.

A word of warn­ing here, though: with some jus­tifi­ca­tion, the origi­nal Dart­mouth AI con­fer­ence could also have claimed to be con­fi­dent that there were no large con­cep­tual holes in their plan of de­vel­op­ing AI over a sum­mer—and we know how wrong they turned out to be. With that thought in mind, on­wards with the re­search agenda.

0.1 Ex­ec­u­tive sum­mary: syn­the­sis process

The first idea of the pro­ject is to iden­tify par­tial prefer­ences as re­sid­ing within hu­man men­tal mod­els. This re­quires iden­ti­fy­ing the ac­tual and hy­po­thet­i­cal in­ter­nal vari­ables of a hu­man, and thus solv­ing the “sym­bol ground­ing prob­lem” for hu­mans; ways of do­ing that are pro­posed.

The pro­ject then sorts the par­tial prefer­ences into var­i­ous cat­e­gories of in­ter­est (ba­sic prefer­ences about the world, iden­tity prefer­ences, meta-prefer­ences about ba­sic prefer­ences, global meta-prefer­ences about the whole syn­the­sis pro­ject, etc...). The aim is then to syn­the­sise these into a sin­gle util­ity func­tion , rep­re­sent­ing the prefer­ence of the hu­man (at a given time or short in­ter­val of time). Differ­ent prefer­ence cat­e­gories play differ­ent roles in this syn­the­sis (eg ob­ject-level prefer­ences get ag­gre­gated, meta-prefer­ences can mod­ify the weights of ob­ject-level prefer­ences, global meta-prefer­ences are used at the de­sign stage, and so on).

The aims are to:

  1. En­sure the syn­the­sis has good prop­er­ties and re­flects ‘s ac­tual prefer­ences, and not any of ’s er­ro­neous fac­tual be­liefs.

  2. En­sure that highly val­ued prefer­ences weight more than lightly held ones, even if the lightly held one is more “meta” that the other.

  3. Re­spect meta-prefer­ences about the syn­the­sis as much as pos­si­ble, but...

  4. ...always en­sure that the syn­the­sis ac­tu­ally reaches an ac­tual non-con­tra­dic­tory .

To en­sure point 4. and 2., there will always be an ini­tial way of syn­the­sis­ing prefer­ences, which cer­tain meta-prefer­ences can then mod­ify in spe­cific ways. This is de­signed to re­solve con­tra­dic­tions (when “I want a sim­ple moral sys­tem” and “value is frag­ile and needs to be pre­served” are both com­pa­rably weighted meta-prefer­ences) and re­move prefer­ence loops (“I want a sim­ple moral sys­tem” is it­self sim­ple and could re­in­force it­self; “I want com­plex­ity in my val­ues” is also sim­ple and could un­der­mine it­self).

The “good prop­er­ties” of 1. are es­tab­lished, in large part, by the global meta-prefer­ences that don’t com­fortably sit within the syn­the­sis frame­work. As for er­ro­neous be­liefs, if wants to date be­cause they think that would make them happy and re­spected, then an AI will syn­the­sise “be­ing happy” and “be­ing re­spected” as prefer­ences, and would push away from if were ac­tu­ally de­luded about what dat­ing them would ac­com­plish.

That is the main the­o­ret­i­cal con­tri­bu­tion of the re­search agenda. It then ex­am­ines what could be done with such a the­ory in prac­tice, and whether the the­ory can be use­fully ap­prox­i­mated for con­struct­ing an ac­tual util­ity func­tion for an AI.

0.2 Ex­ec­u­tive sum­mary: agenda difficulty and value

One early com­men­ta­tor on this agenda re­marked:

[...] it seems like this agenda is try­ing to solve at least 5 ma­jor open prob­lems in philos­o­phy, to a level rigor­ous enough that we can spec­ify them in code:

  1. The sym­bol ground­ing prob­lem.

  2. Iden­ti­fy­ing what hu­mans re­ally care about (not just what they say they care about, or what they act like they care about) and what prefer­ences and meta-prefer­ences even are.

  3. Find­ing an ac­cept­able way of mak­ing in­com­plete and in­con­sis­tent (meta-)prefer­ences com­plete and con­sis­tent.

  4. Find­ing an ac­cept­able way of ag­gre­gat­ing many peo­ple’s prefer­ences into a sin­gle func­tion[2].

  5. The na­ture of per­sonal iden­tity.

I agree that AI safety re­searchers should be more am­bi­tious than most re­searchers, but this seems ex­tremely am­bi­tious, and I haven’t seen you ac­knowl­edge the se­vere out­side-view difficulty of this agenda.

This is in­deed an ex­tremely am­bi­tious pro­ject. But, in a sense, a suc­cess­ful al­igned AI pro­ject will ul­ti­mately have to solve all of these prob­lems. Any situ­a­tion in which most of the fu­ture tra­jec­tory of hu­man­ity is de­ter­mined by AI, is a situ­a­tion where there are solu­tions to all of these prob­lems.

Now, these solu­tions may be im­plicit rather than ex­plicit; equiv­a­lently, we might be able to de­lay solv­ing them via AI, for a while. For ex­am­ple, a tool AI solves these is­sues by be­ing con­tained in such a way that hu­man judge­ment is ca­pa­ble of en­sur­ing good out­comes. Thus hu­mans solve the ground­ing prob­lem, and we de­sign our ques­tions to the AI to en­sure com­pat­i­bil­ity with our prefer­ences, and so on.

But as the power of AIs in­crease, hu­mans will be­come con­fronted by situ­a­tions they have never been in be­fore, and our abil­ity to solve these is­sues diminish (and the prob­a­bil­ities in­crease that we might be ma­nipu­lated or fall into a bad at­trac­tor). This tran­si­tion may sneak up on us, so it is use­ful to start think­ing of how to a) start solv­ing these prob­lems, and b) start iden­ti­fy­ing these prob­lems crisply so we can know when and whether they need to be solved, and when we are mov­ing out of the range of val­idity of the “trust hu­mans” solu­tion. For both these rea­sons, all the is­sues will be listed ex­plic­itly in the re­search agenda.

A third rea­son to in­clude them is so that we know what we need to solve those is­sues for. For ex­am­ple, it is eas­ier to as­sess the qual­ity of any solu­tion to sym­bol ground­ing, if we know what we’re go­ing to do with that solu­tion. We don’t need a full solu­tion, just one good enough to define hu­man par­tial prefer­ences.

And, of course, we need to also con­sider sce­nar­ios where par­tial ap­proaches like tool AI just don’t work, or only work if we solve all the rele­vant is­sues any­way.

Fi­nally, there is a con­verse: par­tial solu­tions to prob­lems in this re­search agenda can con­tribute to im­prov­ing other meth­ods of AI safety al­ign­ment. Sec­tion 3 will look into this in more de­tail. The ba­sic idea is that, to im­prove an al­gorithm or an ap­proach, it is very use­ful to know what we are ul­ti­mately try­ing to do (eg com­pute par­tial prefer­ences, or syn­the­sise a util­ity func­tion with cer­tain ac­cept­able prop­er­ties). If we rely only on mak­ing lo­cal im­prove­ments, guided by in­tu­ition, we may ul­ti­mately get stuck when in­tu­ition runs out; and the im­prove­ments are more likely to be ad-hoc patches than con­sis­tent, gen­er­al­is­able rules.

0.3 Ex­ec­u­tive aside: the value of ap­prox­i­mat­ing the theory

The the­o­ret­i­cal con­struc­tion of in Sec­tions 1 and 2 is a highly com­pli­cated ob­ject, in­volv­ing mil­lions of un­ob­served coun­ter­fac­tual par­tial prefer­ences and a syn­the­sis pro­cess in­volv­ing higher-or­der meta-prefer­ences. Sec­tion 3 touches on how could be ap­prox­i­mated, but, given its com­plex­ity, it would seem that the an­swer would be “only very badly”.

And there is a cer­tain sense in which this is cor­rect. If is the ac­tual ideal­ised util­ity defined by the pro­cess, and is the ap­prox­i­mated util­ity that a real-world AI could com­pute, then it is likely[3] that and will be quite differ­ent in many for­mal senses.

But there is a cer­tain sense in which this is in­cor­rect. Con­sider many of the AI failure sce­nar­ios. For ex­am­ple, imag­ine that the AI, for ex­am­ple, ex­tin­guished all mean­ingful hu­man in­ter­ac­tions be­cause these can some­times be painful and the AI knows that we pre­fer to avoid pain. But it’s clear to us that most peo­ple’s par­tial prefer­ences will not en­dorse to­tal loneli­ness as good out­come; if it’s clear to us, then it’s a for­tiori clear to a very in­tel­li­gent AI; hence the AI will avoid that failure sce­nario.

One should be care­ful with us­ing ar­gu­ments of this type, but it is hard to see how there could be a failure mode that a) we would clearly un­der­stand is in­com­pat­i­ble with proper syn­the­sis of , but b) a smart AI would not. And it seems that any failure mode should be un­der­stand­able to us, as a failure mode, es­pe­cially given some of the in­nate con­ser­vatism of the con­struc­tion of .

Hence, even if is a poor ap­prox­i­ma­tion of in a cer­tain sense, it is likely an ex­cel­lent ap­prox­i­ma­tion of in the sense of avoid­ing ter­rible out­comes. So, though might be large for some for­mal mea­sure of dis­tance , a world where the AI max­imises will be highly ranked ac­cord­ing to .

0.4 An in­spiring just-so story

This is the story of how evolu­tion cre­ated hu­mans with prefer­ences, and what the na­ture of these prefer­ences are. The story is not true, in the sense of ac­cu­rate; in­stead, it is in­tended to provide some in­spira­tion as to the di­rec­tion of this re­search agenda. This sec­tion can be skipped.

In the be­gin­ning, evolu­tion cre­ated in­stinct driven agents. Th­ese agents had no prefer­ences or goals, nor did they need any. They were like Q-learn­ing agents: they knew the cor­rect ac­tion to take in differ­ent cir­cum­stances, but that was it. Con­sider baby tur­tles that walk to­wards the light upon birth, be­cause, tra­di­tion­ally, the sea was lighter than the land—of course, this be­havi­our fails them in the era of ar­tifi­cial light­ing.

But evolu­tion has a tiny band­width, act­ing once per gen­er­a­tion. So it cre­ated agents ca­pa­ble of plan­ning, of figur­ing out differ­ent ap­proaches, rather than hav­ing to fol­low in­stincts. This was use­ful, es­pe­cially in vary­ing en­vi­ron­ments, and so evolu­tion offloaded a lot of its “job” onto the plan­ning agents.

Of course, to be of any use, the plan­ning agents need to be able to model their en­vi­ron­ment to some ex­tent (or else their plans can’t work) and had to have prefer­ences (or else ev­ery plan was as good as an­other). So, in cre­at­ing the first plan­ning agents, evolu­tion cre­ated the first agents with prefer­ences.

Of course, evolu­tion is a messy, undi­rected pro­cess, so the pro­cess wasn’t clean. Plan­ning agents are still riven with in­stincts, and the mod­el­ling of the en­vi­ron­ment is situ­a­tional, used for when it was needed, rather than some con­sis­tent whole. Thus the “prefer­ences” of these agents were un­der­defined and some­times con­tra­dic­tory.

Fi­nally, evolu­tion cre­ated agents ca­pa­ble of self-mod­el­ling and of mod­el­ling other agents in their species. This might have been be­cause of com­pet­i­tive so­cial pres­sures as agents learn to lie and de­tect ly­ing. Of course, this be­ing evolu­tion, this self-and-other-mod­el­ling took the form of kludges built upon span­drels built upon kludges.

And then ar­rived hu­mans, who de­vel­oped norms and norm-vi­o­la­tions. As a side effect of this, we started hav­ing higher-or­der prefer­ences as to what norms and prefer­ences should be. But in­stincts and con­tra­dic­tions re­mained—this is evolu­tion, af­ter all.

And evolu­tion looked upon this hideous mess, and saw that it was good. Good for evolu­tion, that is. But if we want it to be good for us, we’re go­ing to need to straighten out this mess some­what.

1 The par­tial prefer­ences of a human

The main aim of this re­search agenda is to start with a hu­man at or around a given mo­ment and pro­duces a util­ity func­tion which is an ad­e­quate syn­the­sis of the hu­man’s prefer­ences at the time . Un­less the de­pen­dence on needs to be made ex­plicit, this will sim­ply be des­ig­nated as .

Later sec­tions will fo­cus on what can be done with or the meth­ods used for its con­struc­tion; this sec­tion and the next will fo­cus solely on that con­struc­tion. It is mainly based on these posts, with some com­men­tary and im­prove­ments.

Essen­tially the pro­cess is to iden­tify hu­man prefer­ences and meta-prefer­ences within hu­man (par­tial) men­tal model (Sec­tion 1), and find some good way of syn­the­sis­ing these into a whole (Sec­tion 2).

Par­tial prefer­ences (see Sec­tion 1.1) will be de­com­posed into:

  1. Par­tial prefer­ences about the world.

  2. Par­tial prefer­ences about our own iden­tity.

  3. Par­tial meta-prefer­ences about our prefer­ences.

  4. Par­tial meta-prefer­ences about the syn­the­sis pro­cess.

  5. Self-refer­en­tial con­tra­dic­tory par­tial meta-prefer­ences.

  6. Global meta-prefer­ences about the out­come of the syn­the­sis pro­cess.

This sec­tion and the next will lay out how prefer­ences of types 1, 2, 3, and 4 can be used to syn­the­sise the . Sec­tion 2 will con­clude by look­ing what role prefer­ences of type 6 can play. Prefer­ences of type 5 are not dealt with in this agenda, and re­main a peren­nial prob­lem (see Sec­tion 4.5).

1.1 Par­tial mod­els, par­tial preferences

As was shown in the pa­per “Oc­cam’s ra­zor is in­suffi­cient to in­fer the prefer­ences of ir­ra­tional agents”, an agent’s be­havi­our is never enough to es­tab­lish their prefer­ences—even with sim­plic­ity pri­ors or reg­u­lari­sa­tion (see also this post and this one).

There­fore a defi­ni­tion of prefer­ence needs to be grounded in some­thing other than be­havi­our. There are fur­ther ar­gu­ments, pre­sented here, as to why a the­o­ret­i­cal ground­ing is needed even when prac­ti­cal meth­ods are seem­ingly ad­e­quate; this point will be re­turned to later.

The first step is to define a par­tial prefer­ence (and a par­tial model for these to ex­ist in). A par­tial prefer­ence is a prefer­ence that ex­ists within a hu­man be­ing’s in­ter­nal men­tal model, and which con­trasts two[4] situ­a­tions along a sin­gle axis of vari­a­tion, keep­ing other as­pects con­stant. For ex­am­ple, “I wish I was rich (rather than poor)”, “I don’t want to go down that alley, lest I get mugged”, and “this is much worse if there are wit­nesses around” are all par­tial prefer­ences. A more for­mal defi­ni­tion of par­tial prefer­ences, and the par­tial men­tal model in which they ex­ist, is pre­sented here.

Note that this is one of the fun­da­men­tal the­o­ret­i­cal un­der­pin­nings of the method. It iden­ti­fies hu­man (par­tial) prefer­ences as ex­ist­ing within hu­man men­tal mod­els. This is a “nor­ma­tive as­sump­tion”: we choose to define these fea­tures as (par­tial) hu­man prefer­ences, the uni­verse does not com­pel us to do so.

This defi­ni­tion gets around the “Oc­cam’s ra­zor” im­pos­si­bil­ity re­sult, since these men­tal mod­els are fea­tures of the hu­man brain’s in­ter­nal pro­cess, not of hu­man be­havi­our. Con­versely, this also vi­o­lates cer­tain ver­sions of func­tion­al­ism, pre­cisely be­cause the in­ter­nal men­tal states are rele­vant.

A key im­por­tant fea­ture is to ex­tract not only the par­tial prefer­ences it­self, but the in­ten­sity of the prefer­ences, referred to as its weight. This will be key in com­bin­ing the prefer­ences to­gether (tech­ni­cally, we only need the weight rel­a­tive to other par­tial prefer­ences).

1.2 Sym­bol grounding

In or­der to in­ter­pret what a par­tial model means, we need to solve the old prob­lem of sym­bol ground­ing. “I wish I was rich” was pre­sented as an ex­am­ple of a par­tial prefer­ence; but how can we iden­tify “I”, “rich” and the coun­ter­fac­tual “I wish”, all within the mess of the neu­ral net that is the hu­man brain?

To ground these sym­bols, we should ap­proach the is­sue of sym­bol ground­ing em­piri­cally, by aiming to pre­dict the val­ues of real world-vari­ables through knowl­edge of in­ter­nal men­tal vari­ables (see also the ex­am­ple pre­sented here). This em­piri­cal ap­proach can provide suffi­cient ground­ing for the pur­poses of par­tial mod­els, even if sym­bol ground­ing is not solved in the tra­di­tional lin­guis­tic sense of the prob­lem.

This is be­cause each sym­bol has a web of con­no­ta­tions, a col­lec­tion of other sym­bols and con­cepts that co-vary with it, in nor­mal hu­man ex­pe­rience. Since the par­tial mod­els are gen­er­ally defined to be within nor­mal hu­man ex­pe­riences, there is lit­tle differ­ence be­tween any sym­bols that are strongly cor­re­lated.

To for­mal­ise and im­prove this defi­ni­tion, we’ll have to be care­ful about how we define the in­ter­nal vari­ables in the first place—overly com­pli­cated or spe­cific in­ter­nal vari­ables can be cho­sen to cor­re­late ar­tifi­cially well with ex­ter­nal vari­ables. This is, es­sen­tially, “sym­bol ground­ing overfit­ting”.

Another con­sid­er­a­tion is the ex­tent to which the model is con­scious or sub­con­scious; aliefs, for ex­am­ple, could be mod­el­led as sub­con­scious par­tial prefer­ences. For con­sciously en­dorsed aliefs, this is not much of a prob­lem—we in­stinc­tively fear touch­ing fires, and don’t de­sire to lose that fear. But if we don’t en­dorse that alief—for ex­am­ple, we might fear fly­ing and not want to fear it—this be­comes more tricky. Things get con­fus­ing with par­tially en­dorsed aliefs: amuse­ment park rides are ex­tremely safe, and we wouldn’t want to be crip­pled with fear at the thought of go­ing on one. But nei­ther would we want the ex­pe­rience to feel perfectly bland and safe.

1.3 Which (real and hy­po­thet­i­cal) par­tial mod­els?

Another im­por­tant con­sid­er­a­tion is that hu­mans do not have, at the mo­ment , a com­plete set of par­tial mod­els and par­tial prefer­ences. They may have a sin­gle par­tial model in mind, with maybe a few oth­ers in the back­ground—or they might not be think­ing about any­thing like this at all. We could ex­tend the pa­ram­e­ters to some short pe­riod around the time (rea­son­ing that peo­ple’s prefer­ences rarely change in such a short time), but though that gives us more data, it doesn’t give us nearly enough.

The most ob­vi­ous way to get a hu­man to pro­duce an in­ter­nal model is to ask them a rele­vant ques­tion. But we have to be care­ful about this—since hu­man val­ues are change­able and ma­nipu­la­ble, the very act of ask­ing a ques­tion can cause hu­mans to think in cer­tain di­rec­tions, and even cre­ate par­tial prefer­ences where none ex­isted. The more in­ter­ac­tion be­tween the ques­tioner and the hu­man, the more ex­treme prefer­ences can be cre­ated. If the ques­tioner is mo­ti­vated to max­imise the util­ity func­tion that it is also com­put­ing (i.e. if the is an on­line learn­ing pro­cess), then the ques­tioner can rig or in­fluence the learn­ing pro­cess.

For­tu­nately, there are ways of re­mov­ing the ques­tioner’s in­cen­tives to rig or in­fluence the learn­ing pro­cess.

Thus the ba­sic hu­man prefer­ences at time are defined to be those par­tial mod­els pro­duced by “one-step hy­po­thet­i­cals[5]. Th­ese are ques­tions that do not cause the hu­man to be put in un­usual men­tal situ­a­tions, and try and min­imise any de­par­ture from the hu­man’s base-state.

Some prefer­ences are con­di­tional (eg “I want to eat some­thing differ­ent from what I’ve eat so far this week”), as are some meta-prefer­ences (eg “If I hear a con­vinc­ing ar­gu­ment about X be­ing good, I want to pre­fer X”), which could vi­o­late the point of the one-step hy­po­thet­i­cal. Thus con­di­tional (meta-)prefer­ences are only ac­cept­able if their con­di­tions are achieved by short streams of data, un­likely to ma­nipu­late the hu­man. They also should be weighted more if they fit a con­sis­tent nar­ra­tive of what the hu­man is/​wants to be, rather than be­ing ad hoc (this will be as­sessed by ma­chine learn­ing, see Sec­tion 2.4).

Note that among the one-step hy­po­thet­i­cals, are in­cluded ques­tions about rather ex­treme situ­a­tions—heaven and hell, what to do if plants were con­scious, and so on. In gen­eral, we should re­duce the weight[6] of par­tial prefer­ences in ex­treme situ­a­tions[7]. This is be­cause of the un­fa­mil­iar­ity of these situ­a­tions, and be­cause the usual hu­man web of con­no­ta­tions be­tween con­cepts may have bro­ken down (if a plant was con­scious, would it be a plant in the sense we un­der­stand that?). Some­times the break­down is so ex­treme that we can say that the par­tial prefer­ence is fac­tu­ally wrong. This in­cludes effects like the he­do­nic tread­mill: our par­tial mod­els of achiev­ing cer­tain goals of­ten in­clude an imag­ined long-term satis­fac­tion that we would not ac­tu­ally feel. In­deed, it might be good to speci­fi­cally avoid these ex­treme situ­a­tions, rather than hav­ing to make a moral com­pro­mise that might lose part of ’s val­ues due to un­cer­tainty. In that case, am­bigu­ous ex­treme situ­a­tions get a slight in­trin­sic nega­tive—that might be over­come by other con­sid­er­a­tions, but is there nonethe­less.

A fi­nal con­sid­er­a­tion is that some con­cepts just dis­in­te­grate in gen­eral en­vi­ron­ments—for ex­am­ple, con­sider a prefer­ence for “nat­u­ral” or “hand-made” prod­ucts. In those cases, the web of con­no­ta­tions can be used to ex­tract some prefer­ences in gen­eral—for ex­am­ple, “nat­u­ral”, used in this way has con­no­ta­tions[8] of “healthy”, “tra­di­tional”, and “non-pol­lut­ing”, all of which ex­tend bet­ter to gen­eral en­vi­ron­ments than “nat­u­ral” does. Some­times, the prefer­ence can be pre­served but routed around: some ver­sions of “no ar­tifi­cial ge­netic mod­ifi­ca­tions” could be satis­fied by se­lec­tive breed­ing that achieved the same re­sult. And some ver­sions couldn’t; it’s all a func­tion of what pow­ers the un­der­ly­ing prefer­ence: spe­cific tech­niques, or a gen­eral wari­ness of these types of op­ti­mi­sa­tion. Meta-prefer­ences might be very rele­vant here.

2 Syn­the­sis­ing the prefer­ence util­ity function

Here we will sketch out the con­struc­tion of the hu­man util­ity func­tion , from the data that is the par­tial prefer­ences and their (rel­a­tive) weights.

This is not, by any means, the only way of con­struct­ing . But it is illus­tra­tive of how the util­ity could be con­structed, and can be more use­fully cri­tiqued and analysed than a va­guer de­scrip­tion.

2.1 What sort of util­ity func­tion?

Par­tial prefer­ences are defined over states of the world or states of the hu­man . The later in­cluded both things like “be­ing satis­fied with life” (purely in­ter­nal) and “be­ing an hon­ourable friend” (mostly about ’s be­havi­our).

Con­se­quently, must also be defined over such things, so is de­pen­dent on states of the world and states of the hu­man . Un­like stan­dard MDP-like situ­a­tions, these states can in­clude the his­tory of the world or of up to that point—prefer­ences like “don’t speak ill of the dead” abound in hu­mans.

2.2 Why a util­ity func­tion?

Why should we aim to syn­the­sise a util­ity func­tion, when hu­man prefer­ences are very far from be­ing util­ity func­tions?

It’s not of an in­nate ad­mira­tion for util­ity func­tions, or a de­sire for math­e­mat­i­cal el­e­gance. It’s be­cause they tend to be sta­ble un­der self-mod­ifi­ca­tion. Or, to be more ac­cu­rate, they seem to be much more sta­ble than prefer­ences that are not util­ity func­tions.

In the im­mi­nent fu­ture, hu­man prefer­ences are likely to be­come sta­ble and un­chang­ing. There­fore it makes more sense to cre­ate a prefer­ence syn­the­sis that is already sta­ble, that cre­ate a po­ten­tially un­sta­ble one and let it ran­domly walk it­self to sta­bil­ity (though see Sec­tion 4.6).

Also, and this is one of the mo­ti­va­tions be­hind clas­si­cal in­verse re­in­force­ment learn­ing, re­ward/​util­ity func­tions tend to be quite portable, and can be moved from one agent to an­other or from one situ­a­tion to an­other, with greater ease than other goal struc­tures.

2.3 Ex­tend­ing and nor­mal­is­ing par­tial preferences

Hu­man val­ues are change­able, ma­nipu­la­ble, un­der­defined, and con­tra­dic­tory. By fo­cus­ing around time , we have re­moved the change­able prob­lem for par­tial prefer­ences (see this post for thoughts on how long a pe­riod around should be al­lowed); ma­nipu­la­ble has been dealt with by re­mov­ing the pos­si­bil­ity of the AI in­fluenc­ing the learn­ing pro­cess.

Be­ing un­der­defined re­mains a prob­lem, though. It would be pos­si­ble to overfit ab­surdly speci­fi­cally to the hu­man’s par­tial mod­els, and gen­er­ate a that is in full agree­ment with our par­tial prefer­ences and ut­terly use­less. So the first thing to do is to group the par­tial prefer­ences to­gether ac­cord­ing to similar­ity (for ex­am­ple, prefer­ences for con­cepts closely re­lated in terms of webs of con­no­ta­tions should gen­er­ally be grouped to­gether), and gen­er­al­ise them in some reg­u­larised way. Gen­er­al­ise means, here, that they are trans­formed into full prefer­ences, com­par­ing all pos­si­ble uni­verses. Though this would only be com­par­ing on the nar­row crite­ria that were used for the par­tial prefer­ence: a par­tial prefer­ence fear of be­ing mugged could gen­er­al­ise to a fear of pain/​vi­o­lence/​vi­o­la­tion/​theft across all uni­verses, but would not in­clude other as­pects of our prefer­ences. So they are full prefer­ences, in terms of ap­ply­ing to all situ­a­tions, but not the full set of our prefer­ences, in terms of tak­ing into ac­count all our par­tial prefer­ences.

It seems that stan­dard ma­chine learn­ing tech­niques should already be up to the task of mak­ing full prefer­ences from col­lec­tions of par­tial prefer­ences (with all the usual cur­rent prob­lems). For ex­am­ple, clus­ter­ing of similar prefer­ences would be nec­es­sary. There are un­su­per­vised ML al­gorithms that can do that; but even su­per­vised ML al­gorithms end up group­ing la­bel­led data to­gether in ways that define ex­ten­sions of the la­bels into higher di­men­sional space. Where could these la­bels come from? Well, they could come from grounded sym­bols within meta-prefer­ences. A meta-prefer­ence of the form “I would like to be free of bias” con­tains some model of what “bias” is; if that meta-prefer­ence is par­tic­u­larly weighty, then clus­ter­ing prefer­ences by whether or not they are bi­ases could be a good thing to do.

Once the par­tial prefer­ences are gen­er­al­ised in this way, re­mains the prob­lem of them be­ing con­tra­dic­tory. This is not as big a prob­lem as it may seem. First of all, it is very rare for prefer­ences to be ut­terly op­posed: there is al­most always some com­pro­mise available. So an al­tru­ist with mur­der­ous ten­den­cies could com­bine char­ity work with ag­gres­sive on­line gam­ing; in­deed some whole com­mu­ni­ties (such as BDSM) are de­signed to bal­ance “op­pos­ing” de­sires for risk and safety.

So in gen­eral, the way to deal with con­tra­dic­tory prefer­ences is to weight them ap­pro­pri­ately, then add them to­gether; any com­pro­mise will then ap­pear nat­u­rally from the weighted sum[9].

To do that, we need to nor­mal­ise the prefer­ences in some way. We might seek to do this in an a pri­ori, prin­ci­pled way, or through par­tial mod­els that in­clude the trade­offs be­tween differ­ent prefer­ences. Prefer­ences that per­tain to ex­treme situ­a­tions, far re­moved from ev­ery­day hu­man situ­a­tions, could also be pe­nal­ised in this weight­ing pro­cess (as the hu­man should be less cer­tain about these).

Now that the par­tial prefer­ences have been iden­ti­fied and weighted, the challenge is to syn­the­sise them into a sin­gle .

2.4 Syn­the­sis­ing the prefer­ence func­tion: first step

So this is how one could do the first step of prefer­ence syn­the­sis:

  1. Group similar par­tial prefer­ences to­gether, gen­er­al­ise them to full prefer­ences with­out overfit­ting.

  2. Use par­tial mod­els to com­pute the rel­a­tive weight be­tween differ­ent par­tial prefer­ences.

  3. Us­ing those rel­a­tive weights, and again with­out overfit­ting, syn­the­sise those prefer­ences into a sin­gle util­ity func­tion .

This all seems doable in the­ory within stan­dard ma­chine learn­ing. See Sec­tion 2.3 and the dis­cus­sion of clus­ter­ing for point 1. Point 2. comes from the defi­ni­tion of par­tial prefer­ences. And point 3. is just an is­sue of fit­ting a good reg­u­larised ap­prox­i­ma­tion to noisy data.

In cer­tain sense, this pro­cess is the par­tial op­po­site how Ja­cob Falkovich used a spread­sheet to find a life part­ner. In that pro­cess, he started by fac­tor­ing his goal of hav­ing a life-part­ner in many differ­ent sub­goals. He then ranked the pu­ta­tive part­ners on each of the sub­goals by com­par­ing two op­tions at a time, and build­ing a (car­di­nal) rank­ing from these com­par­i­sons. The pro­cess here also aims to as­sign car­di­nal val­ues from com­par­i­sons of two op­tions, but the con­struc­tion of the “sub­goals” (full prefer­ences) is han­dled by ma­chine learn­ing from the sets of weighted com­par­i­sons.

2.5 Iden­tity preferences

Some prefer­ences are best un­der­stood as per­tain­ing to our own iden­tity. For ex­am­ple, I want to un­der­stand how black holes work; this is sep­a­rate from my other prefer­ence that some hu­mans un­der­stand black holes (and sep­a­rate again from an in­stru­men­tal prefer­ence that, had we a con­ve­nient black hole close to hand, that we could use it to get en­ergy out of).

Iden­tity prefer­ences seem to be differ­ent from prefer­ences about the world; they seem more frag­ile than other prefer­ences. We could com­bine iden­tity prefer­ence differ­ently from stan­dard prefer­ences, for ex­am­ple us­ing smooth­min rather than sum­ma­tion.

Ul­ti­mately, the hu­man’s men­tal ex­change rate be­tween prefer­ences should de­ter­mine how prefer­ences are com­bined. This should al­low us to treat iden­tity and world-prefer­ences in the same way. There are two rea­sons to still dis­t­in­guish be­tween world-prefer­ences and iden­tity prefer­ences:

  1. For prefer­ences where rel­a­tive weights are un­known or ill-defined, lin­ear com­bi­na­tions and smooth-min serve as a good de­fault for world-prefer­ences and iden­tity prefer­ences re­spec­tively.

  2. It’s not cer­tain that iden­tity can be fully cap­tured by par­tial prefer­ences; in that case, iden­tity prefer­ences could serve as a start­ing point from which to build a con­cept of hu­man iden­tity.

2.6 Syn­the­sis­ing the prefer­ence func­tion: meta-preferences

Hu­mans gen­er­ally have meta-prefer­ences: prefer­ences over the kind of prefer­ences they should have (of­ten phrased as prefer­ences over their iden­tity, eg “I want to be more gen­er­ous”, or “I want to have con­sis­tent prefer­ences”).

This is such an im­por­tant fea­ture of hu­mans, that it needs its own treat­ment; this post first looked into that.

The stan­dard meta-prefer­ences en­dorse or un­en­dorse lower lever prefer­ences. First one can com­bine them as in the method above, and get a syn­the­sised meta-prefer­ence. Then this in­creases or de­creases the weights of the lower level prefer­ences, to reach a with prefer­ence weights ad­justed by the syn­the­sised meta-prefer­ences.

Note that this re­quires some or­der­ing of the meta-prefer­ences: each meta-prefer­ence refers only to meta-prefer­ences “be­low” it­self. Self-refer­en­tial meta-prefer­ences (or, equiv­a­lently, meta-prefer­ences refer­ring to each other in a cy­cle) are more sub­tle to deal with, see Sec­tion 4.5.

Note that an or­der­ing does not mean that the higher meta-prefer­ences must dom­i­nate the lower ones; a weakly held meta-prefer­ence (eg a vague de­sire to fit in with some for­mal stan­dard of be­havi­our) need not over­rule a strongly held ob­ject level prefer­ence (eg a strong love for a par­tic­u­lar per­son, or em­pa­thy for an en­emy).

2.7 Syn­the­sis­ing the prefer­ence func­tion: meta-prefer­ence about synthesis

In a spe­cial cat­e­gory are the meta-prefer­ence about the syn­the­sis pro­cess it­self. For ex­am­ple, philoso­phers might want to give greater weight to higher or­der meta-prefer­ences, or might value the sim­plic­ity of the whole .

One can deal with that by us­ing the stan­dard syn­the­sis (of Sec­tion 2.4) to com­bine the method meta-prefer­ences, then use this com­bi­na­tion to change how stan­dard prefer­ences are syn­the­sised. This old post has some ex­am­ples of how this could be achieved.

As long as there is an or­der­ing of meta-prefer­ences about syn­the­sis, one can use the stan­dard method to syn­the­sise the high­est level of meta-prefer­ences, which then tells us how to syn­the­sise the lower-level meta-prefer­ences about syn­the­sis, and so on.

Why use the stan­dard syn­the­sis method for these meta-prefer­ences—es­pe­cially if they con­tra­dict this syn­the­sis method ex­plic­itly? There are three rea­sons for this:

  1. Th­ese meta-prefer­ences may be weakly weighted (hence weakly held), so they should not au­to­mat­i­cally over­whelm the stan­dard syn­the­sis pro­cess when ap­plied to them­selves (think of con­ti­nu­ity as the weight of the meta-prefer­ence fades to zero).

  2. Let­ting meta-prefer­ences about syn­the­sis de­ter­mine how they them­selves get syn­the­sised leads to cir­cu­lar meta-prefer­ences, which may cause prob­lems (see Sec­tion 4.5).

  3. The stan­dard method is more pre­dictable, which makes the whole pro­cess more pre­dictable; self-refer­ence, even if re­solved, could lead to out­comes ran­domly far away from the in­tended one. Pre­dictabil­ity could be es­pe­cially im­por­tant for “meta-prefer­ences over out­comes” of the next sec­tion.

Note that these syn­the­sis meta-prefer­ences should be of a type that af­fects the syn­the­sis of , not its fi­nal form. So, for ex­am­ple, “sim­ple (meta-)prefer­ences should be given ex­tra weight in ” is valid, while ” should be sim­ple” is not.

Thus, fi­nally, we can com­bine ev­ery­thing (ex­cept for some self-refer­enc­ing con­tra­dic­tory prefer­ences) into one .

Note there are many de­grees of free­dom in how the syn­the­sis could be car­ried out; it’s hoped that they don’t mat­ter much, and that each of them will reach a that avoids dis­asters[10] (see Sec­tion 2.8).

2.8 Avoid­ing dis­asters, and global meta-preferences

It is im­por­tant that we don’t end up in some dis­as­trous out­come; the very defi­ni­tion of a good hu­man value the­ory re­quires this.

The ap­proach has some in-built pro­tec­tion against many types of dis­asters. Part of that is that it can in­clude very gen­eral and uni­ver­sal par­tial prefer­ences, so any com­bi­na­tion of “lo­cal” par­tial prefer­ences must be com­pat­i­ble with these. For ex­am­ple, we might have a col­lec­tion of prefer­ences about au­ton­omy, pain, and per­sonal growth. It’s pos­si­ble that, when syn­the­sis­ing these prefer­ences to­gether, we could end up with some “kill ev­ery­one” prefer­ence, due to bad ex­trap­o­la­tion. How­ever, if we have a strong “don’t kill ev­ery­one” prefer­ence, this will push the syn­the­sis pro­cess away from that out­come.

So some dis­as­trous out­comes of the syn­the­sis should be avoided, pre­cisely be­cause all of ’s prefer­ences are used, in­clud­ing those that would speci­fi­cally la­bel that out­come a dis­aster.

But, even if we in­cluded all of ’s prefer­ences in the syn­the­sis, we’d still want to be sure we’d avoided dis­asters.

In one sense, this re­quire­ment is triv­ially true and use­ful. But in an­other, it seems per­verse and wor­ry­ing—the is sup­posed to be a syn­the­sis of true hu­man prefer­ences. By defi­ni­tion. So how could this be, in any sense, a dis­aster? Or a failure? What crite­ria—apart from our own prefer­ences—could we use? And shouldn’t we be us­ing these prefer­ences in the syn­the­sis it­self?

The rea­son that we can talk about not be­ing a dis­aster, is that not all our prefer­ences can best be cap­tured in the par­tial model for­mal­ism above. Sup­pose one fears a siren world or re­as­sures one­self that we can never en­counter an in­de­scrib­able hel­l­world. Both of these could be clunkily trans­formed into stan­dard meta-prefer­ences (maybe about what some devil’s ad­vo­cate AI could tell us?). But that some­what misses the point. Th­ese top-meta-level con­sid­er­a­tions live most nat­u­rally at the top-meta-level: re­duc­ing them to the stan­dard for­mat of other prefer­ences and meta-prefer­ences risks los­ing the point. Espe­cially when we only par­tially un­der­stand these is­sues, trans­lat­ing them to stan­dard meta-prefer­ences risks los­ing the un­der­stand­ing we do have.

So, it re­mains pos­si­ble to say that is “good” or “bad”, us­ing higher level con­sid­er­a­tions that are difficult to cap­ture en­tirely within .

For ex­am­ple, there is an ar­gu­ment that hu­man prefer­ence in­co­her­ence should not cost us much. If true, this ar­gu­ment sug­gests that overfit­ting to the de­tails of hu­man prefer­ences is not as bad as we might fear. One could phrase this as a syn­the­sis meta-prefer­ence al­low­ing more over-fit­ting, but this doesn’t cap­ture a co­her­ent mean­ing of “not as bad”—which pre­cludes the real point of this ar­gu­ment, which is “al­low more overfit­ting if the ar­gu­ment holds”. To use that, we need some crite­ria for es­tab­lish­ing “the ar­gu­ment holds”. This seems very hard to do within the syn­the­sis pro­cess, but could be at­tempted as top-level meta-prefer­ences.

We should be cau­tious and se­lec­tive when us­ing these top-level prefer­ences in this way. This is not gen­er­ally the point at which we should be adding prefer­ences to ; that should be done when con­struct­ing . Still, if we have a small se­lec­tion of crite­ria, we could for­mal­ise these and check our­selves whether satis­fies them, or have an AI do so while syn­the­sis­ing . A Last Judge can be a sen­si­ble pre­cau­tion (es­pe­cially if there are more down­sides to er­ror than up­sides to perfec­tion).

Note that we need to dis­t­in­guish be­tween the global meta-prefer­ences of the de­sign­ers (us) and those of the sub­ject . So, when de­sign­ing the syn­the­sis pro­cess, we should ei­ther al­low op­tions to be au­to­mat­i­cally changed by ‘s global prefer­ences, or be aware that we are over­rid­ing them with our own judge­ment (which may be in­evitable, as most ’s have not thought deeply about prefer­ence syn­the­sis; still, it is good to be aware of this is­sue).

This is also the level at which ex­per­i­men­tal test­ing of syn­the­sis is likely to be use­ful—keep­ing in mind what we ex­pect from syn­the­sis, and run­ning the syn­the­sis in some com­pli­cated toy en­vi­ron­ments, we can see whether our ex­pec­ta­tions are cor­rect. We may even dis­cover ex­tra top-level desider­ata this way.

2.9 How much to del­e­gate to the process

The method has two types of ba­sic prefer­ences (world-prefer­ences and iden­tity prefer­ences). This is a some­what use­ful di­vi­sion; but there are oth­ers that could have been used. Altru­is­tic ver­sus self­ish ver­sus anti-al­tru­is­tic prefer­ences is a di­vi­sion that was not used (though see Sec­tion 4.3). Mo­ral prefer­ences were not di­rectly dis­t­in­guished from non-moral prefer­ences (though some hu­man meta-prefer­ences might make the dis­tinc­tion).

So, why di­vide prefer­ences this way, rather than in some other way? The aim is to al­low the pro­cess it­self to take into ac­count most of the di­vi­sions that we might care about; things that go into the model ex­plic­itly are struc­tural as­sump­tions that are of vi­tal im­por­tance. So the di­vi­sion be­tween world- and iden­tity prefer­ences was cho­sen be­cause it seemed ab­solutely cru­cial to get that right (and to err on the side of cau­tion in dis­t­in­guish­ing the two, even if our own prefer­ences don’t dis­t­in­guish them as much). Similarly, the whole idea of meta-prefer­ences seems a cru­cial fea­ture of hu­mans, which might not be rele­vant for gen­eral agents, so it was im­por­tant to cap­ture it. Note that meta-prefer­ences are treated as a differ­ent type to stan­dard prefer­ences, with differ­ent rules; most dis­tinc­tions built into the syn­the­sis method should similarly be be­tween ob­jects of a differ­ent type.

But this is not set in stone; global meta-prefer­ences (see Sec­tion 2.8) could be used to jus­tify a differ­ent di­vi­sion of prefer­ence types (and differ­ent meth­ods of syn­the­sis). But it’s im­por­tant to keep in mind what as­sump­tions are be­ing im­posed from out­side the pro­cess, and what the method is al­lowed to learn dur­ing the pro­cess.

3 in practice

3.1 Syn­the­sis of in practice

If the defi­ni­tion of of the pre­vi­ous sec­tion could be made fully rigor­ous, and if the AI has a perfect model of ’s brain, knowl­edge of the uni­verse, and un­limited com­put­ing power, it could con­struct perfectly and di­rectly. This will al­most cer­tainly not be the case; so, do all these defi­ni­tions give us some­thing use­ful to work with?

It seems they do. Even ex­treme defi­ni­tions can be ap­prox­i­mated, hope­fully to some good ex­tent (and the the­ory al­lows us to as­sess the qual­ity of the ap­prox­i­ma­tion, as op­posed to an­other method with­out the­ory, where there is no mean­ingful mea­sure of ap­prox­i­ma­tion abil­ity). See Sec­tion 0.3 for an ar­gu­ment as to why even very ap­prox­i­mate ver­sions of could re­sult in very pos­i­tive out­comes: even ap­prox­i­mated rule out most bad AI failure sce­nar­ios.

In prac­ti­cal terms, the syn­the­sis of from par­tial prefer­ences seems quite ro­bust and doable; it’s the defi­ni­tion of these par­tial prefer­ences that seems tricky. One might be able to di­rectly see the in­ter­nal sym­bols in the hu­man brain, with some fu­ture su­per-ver­sion of fMRI. Even with­out that di­rect in­put, hav­ing a the­ory of what we are look­ing for—par­tial prefer­ence in par­tial mod­els with hu­man sym­bols grounded—al­lows us to use re­sults from stan­dard and moral psy­chol­ogy. Th­ese re­sults are in­sights into be­havi­our, but they are of­ten also, at least in part, in­sights into how the hu­man brain pro­cesses in­for­ma­tion. In Sec­tion 3.3, we’ll see how the defi­ni­tion of al­lows us to “patch” other, more clas­si­cal meth­ods of value al­ign­ment. But the con­verse is also true: with a good the­ory, we can use more clas­si­cal meth­ods to figure out . For ex­am­ple, if we see as be­ing in a situ­a­tion where they are likely to tell the truth about their in­ter­nal model, then their stated prefer­ences be­come good prox­ies for their in­ter­nal par­tial prefer­ences.

If we have a good the­ory for how hu­man prefer­ences change over time, then we can use prefer­ences at time as ev­i­dence for the hy­po­thet­i­cal prefer­ences at time . In gen­eral, more prac­ti­cal knowl­edge and un­der­stand­ing would lead to a bet­ter un­der­stand­ing of the par­tial prefer­ences and how they change over time.

This could be­come an area of in­ter­est­ing re­search; once we have a good the­ory, it seems there are many differ­ent prac­ti­cal meth­ods that sud­denly be­come us­able.

For ex­am­ple, it seems that hu­mans model them­selves and each other us­ing very similar meth­ods. This al­lows us to use our own judge­ment of ir­ra­tional­ity and in­ten­tion­al­ity, to some ex­tent, and in a prin­ci­pled way, to as­sess the in­ter­nal mod­els of other hu­mans. As we shall see in Sec­tion 3.3, an aware­ness of what we are do­ing—us­ing the similar­ity be­tween our in­ter­nal mod­els and those of oth­ers—also al­lows us to as­sess when this method stops work­ing, and patch it in a prin­ci­pled way.

In gen­eral, this sort of re­search would give re­sults of the type “as­sum­ing this con­nec­tion be­tween em­piri­cal facts and in­ter­nal mod­els (an as­sump­tion with some ev­i­dence be­hind it), we can use this data to es­ti­mate in­ter­nal mod­els”.

3.2 (Avoid­ing) un­cer­tainty and ma­nipu­la­tive learning

There are ar­gu­ments that, as long as we ac­count prop­erly for our un­cer­tainty and fuzzi­ness, there are no Good­hart-style prob­lems in max­imis­ing an ap­prox­i­ma­tion to . This ar­gu­ment has been dis­puted, and there are on­go­ing de­bates about it.

With a good defi­ni­tion of what it means for the AI to in­fluence the learn­ing pro­cess, on­line learn­ing of be­comes pos­si­ble, even for pow­er­ful AIs learn­ing over long pe­ri­ods of time in which the hu­man changes their views (ei­ther nat­u­rally or as a con­se­quence of the AI’s ac­tions).

Thus, we could con­struct an on­line ver­sion of in­verse re­in­force­ment learn­ing with­out as­sum­ing ra­tio­nal­ity, where the AI learns about par­tial mod­els and hu­man be­havi­our si­mul­ta­neously, con­struct­ing the from ob­ser­va­tions given the right data and the right as­sump­tions.

3.3 Prin­ci­pled patch­ing of other methods

Some of the the­o­ret­i­cal ideas pre­sented here can be used to im­prove other AI al­ign­ment ideas. This post ex­plains one of the ways this can hap­pen.

The ba­sic idea is that there ex­ist meth­ods—stated prefer­ences, re­vealed prefer­ences, an ideal­ised hu­man re­flect­ing for a very long time—that are of­ten cor­re­lated with and with each other. How­ever, all of the meth­ods fail—stated prefer­ences are of­ten dishon­est (the rev­e­la­tion prin­ci­ple doesn’t ap­ply in the so­cial world), re­vealed prefer­ences as­sume a ra­tio­nal­ity that is of­ten ab­sent in hu­mans (and some mod­els of re­vealed prefer­ences ob­scure how un­re­al­is­tic this ra­tio­nal­ity as­sump­tion is), hu­mans that think for a long time have the pos­si­bil­ity of value drift or ran­dom walks to con­ver­gence.

Given these flaws, it is always tempt­ing to patch the method: add caveats to get around the spe­cific prob­lem en­coun­tered. How­ever, if we patch and patch un­til we can no longer think of any fur­ther prob­lems, that doesn’t mean there are no fur­ther prob­lems: sim­ply that they are likely be­yond our ca­pac­ity to pre­dict ahead of time. And, if all that it has is a list of patches, the AI is un­likely to be able to deal with these new prob­lems.

How­ever, if we keep the defi­ni­tion of in mind, we can come up with prin­ci­pled rea­sons to patch a method. For ex­am­ple, ly­ing on stated prefer­ences means a di­ver­gence be­tween stated prefer­ences and in­ter­nal model; re­vealed prefer­ences only re­veal within the pa­ram­e­ters of the par­tial model that is be­ing used; and value drift is a failure of prefer­ence syn­the­sis.

There­fore, each patch can have an ex­pla­na­tion for the di­ver­gence be­tween method and de­sired out­come. So, when the AI de­vel­ops the method fur­ther, it can it­self patch the method, when it en­ters a situ­a­tion where a similar type of di­ver­gence. It has a rea­son for why these patches ex­ist, and hence the abil­ity to gen­er­ate new patches effi­ciently.

3.4 Sim­plified suffi­cient for many methods

It’s been ar­gued that many differ­ent meth­ods rely upon, if not a com­plete syn­the­sis , at least some sim­plified ver­sion of it. Cor­rigi­bil­ity, low im­pact, and dis­til­la­tion/​am­plifi­ca­tion all seem to be meth­ods that re­quire some sim­plified ver­sion of .

Similarly, some con­cepts that we might want to use or avoid—such as “ma­nipu­la­tion” or “un­der­stand­ing the an­swer”—also may re­quire a sim­plified util­ity func­tion. If these con­cepts can be defined, then one can dis­en­tan­gle them from the rest of the al­ign­ment prob­lem, al­low­ing us to in­struc­tively con­sider situ­a­tions where the con­cept makes sense.

In that case, a sim­plified or in­com­plete con­struc­tion of , us­ing some sim­plifi­ca­tion of the syn­the­sis pro­cess, might be suffi­cient for one of the meth­ods or defi­ni­tions just listed.

3.5 Ap­ply­ing the in­tu­itions be­hind to analysing other situations

Fi­nally, one could use the defi­ni­tion of as in­spira­tion when analysing other meth­ods, which could lead to in­ter­est­ing in­sights. See for ex­am­ple these posts on figur­ing out the goals of a hi­er­ar­chi­cal sys­tem.

4 Limits of the method

This sec­tion will look at some of the limi­ta­tions and la­cuna of the method de­scribed above. For some limi­ta­tions, it will sug­gest pos­si­ble ways of deal­ing with them; but these are, de­liber­ately, cho­sen to be ex­tras be­yond the scope of the method, where syn­the­sis­ing is the whole goal.

4.1 Utility at one point in time

The is meant to be a syn­the­sis of the cur­rent prefer­ences and meta-prefer­ences of the hu­man , us­ing one-step hy­po­thet­i­cals to fill out the defi­ni­tion. Hu­man prefer­ences are change­able on a short time scale, with­out us feel­ing that we be­come a differ­ent per­son. Hence it may make sense to re­place with some av­er­age , av­er­aged over a short (or longer) pe­riod of time. Shorter pe­riod lead to more “overfit­ting” to mo­men­tary urges; longer pe­riod al­low more ma­nipu­la­tion or drift.

4.2 Not a philo­soph­i­cal ideal

The is also not a re­flec­tive equil­ibrium or other ideal­ised dis­til­la­tion of what prefer­ences should be. Philoso­phers will tend to have a more ideal­ised , as will those who have re­flected a lot and are more will­ing to be bul­let swal­low­ers/​bul­let bit­ters. But that is be­cause these peo­ple have strong meta-prefer­ences that push in those ideal­ised di­rec­tions, so any hon­est syn­the­sis of their prefer­ences must re­flect these.

Similarly, this is defined to be the prefer­ences of some hu­man . If that hu­man is bi­goted or self­ish, their will be bi­goted or self­ish. In con­trast, moral prefer­ences that can be con­sid­ered fac­tu­ally wrong will be filtered out by this con­struc­tion. Similarly, prefer­ences based on er­ro­neous fac­tual be­liefs (“trees can think, so...”) will be re­moved or qual­ified (“if trees could think, then...”).

Thus if is wrong, the will not re­flect that wrong­ness; but if is evil, then will re­flect that evil­ness.

Also, the pro­ce­dure will not dis­t­in­guish be­tween moral prefer­ences and other types of prefer­ences, un­less the hu­man them­selves does.

4.3 In­di­vi­d­ual util­ity ver­sus com­mon utility

This re­search agenda will not look into how to com­bine the of differ­ent hu­mans. One could sim­ply weight the util­ities ac­cord­ing to some semi-plau­si­ble scale and add them to­gether.

But we could do many other things as well. I’ve sug­gested re­mov­ing anti-al­tru­is­tic prefer­ences be­fore com­bin­ing the ’s into some global util­ity func­tion for all of hu­man­ity—or for all fu­ture and cur­rent sen­tient be­ings, or for all be­ings that could suffer, or for all phys­i­cal en­tities.

There are strong game-the­o­ret­i­cal rea­sons to re­move anti-al­tru­is­tic prefer­ences. We might also add philo­soph­i­cal con­sid­er­a­tions (eg moral re­al­ism) or de­on­tolog­i­cal rules (eg hu­man rights, re­stric­tions on copy­ing them­selves, ex­tra weight­ing to cer­tain types of prefer­ences), ei­ther to the in­di­vi­d­ual or when com­bin­ing them, or pri­ori­tise moral prefer­ences over other types. We might want to pre­serve the ca­pac­ity for moral growth, some­how (see Sec­tion 4.6).

That can all be done, but is not part of this re­search agenda, whose sole pur­pose is to syn­the­sise the in­di­vi­d­ual ’s, which can then be used for other pur­poses.

4.4 Syn­the­sis­ing rather than dis­cov­er­ing it (moral anti-re­al­ism)

The util­ity will be con­structed, rather than de­duced or dis­cov­ered. Some moral the­o­ries (such as some ver­sions of moral re­al­ism) posit that there is a (gen­er­ally unique) wait­ing to be dis­cov­ered. But none of these the­o­ries give effec­tive meth­ods for do­ing so.

In the ab­sence of such a defi­ni­tion of how to dis­cover an ideal , it would be highly dan­ger­ous to as­sume that find­ing is a pro­cess of dis­cov­ery. Thus the whole method is con­struc­tive from the very be­gin­ning (and based on a small num­ber of ar­bi­trary choices).

Some ver­sions of moral re­al­ism could make use of as a start­ing point of their own defi­ni­tion. In­deed, in prac­tice, moral re­al­ism and moral anti-re­al­ism seem to be ini­tially al­most iden­ti­cal when meta-prefer­ences are taken into ac­count. Mo­ral re­al­ists of­ten have men­tal ex­am­ples of what counts as “moral re­al­ism doesn’t work”, while moral anti-re­al­ists still want to sim­plify and or­ganise moral in­tu­itions. To a first ap­prox­i­ma­tion, these ap­proaches can be very similar in prac­tice.

4.5 Self-refer­en­tial con­tra­dic­tory preferences

There re­main prob­lems with self-refer­en­tial prefer­ences—prefer­ences that claim they should be given more (or less) weight than oth­er­wise (eg “all sim­ple meta-prefer­ences should be pe­nal­ised”). This was already ob­served in a pre­vi­ous post.

This in­cludes for­mal Gödel-style prob­lems, with prefer­ences ex­plic­itly con­tra­dict­ing them­selves, but those seem solv­able—with one or an­other ver­sion of log­i­cal un­cer­tainty.

More wor­ry­ing, from the prac­ti­cal stand­point, is the hu­man ten­dency to re­ject val­ues im­posed upon them, just be­cause they are im­posed upon them. This re­sem­bles a prefer­ence of the type “re­ject any com­puted by any syn­the­sis pro­cess”. This prefer­ence is weakly ex­is­tent in al­most all of us, and a va­ri­ety of our other prefer­ences should pre­vent the AI from forcibly re-writ­ing us to be­come -de­siring agents.

So it re­mains not at all clear what hap­pens when the AI says “this is what you re­ally pre­fer” and we al­most in­evitably an­swer “no!”

Of course, since the is con­structed rather than real, there is some lat­i­tude. It might be pos­si­ble to in­volve the hu­man in the con­struc­tion pro­cess, in a way that in­creases their buy-in (thanks to Tim Ge­newein for the sug­ges­tion). Maybe the AI could con­struct the first , and re­fine it with fur­ther in­ter­ac­tions with the hu­man. And maybe, in that situ­a­tion, if we are con­fi­dent that is pretty safe, we’d want the AI to sub­tly ma­nipu­late the hu­man’s prefer­ences to­wards it.

4.6 The ques­tion of iden­tity and change

It’s not cer­tain that hu­man con­cepts of iden­tity can be fully cap­tured by iden­tity prefer­ences and meta-prefer­ences. In that case, it is im­por­tant that hu­man iden­tity be figured out some­how, lest hu­man­ity it­self van­ish even as our prefer­ences are satis­fied. Nick Bostrom sketched how this might hap­pen: in the mind­less out­sourcers sce­nario, hu­man out­source more and more of their key cog­ni­tive fea­tures to au­to­mated al­gorithms, un­til noth­ing re­mains of “them” any more.

Some­what re­lated is the fact that many hu­mans see change and per­sonal or moral growth as a key part of their iden­tity. Can such a de­sire be ac­com­mo­dated, de­spite a likely sta­bil­i­sa­tion of val­ues, with­out just be­com­ing a ran­dom walk across prefer­ence space?

Some as­pects of growth and change can be ac­com­mo­dated. Hu­mans can cer­tainly be­come more skil­led, more pow­er­ful, and more knowl­edge­able. Since hu­mans don’t dis­t­in­guish well be­tween ter­mi­nal and in­stru­men­tal goals, some forms of fac­tual learn­ing re­sem­ble moral learn­ing (“if it turns out that an­ar­chism re­sults in the great­est flour­ish­ing of hu­man­ity, then I wish to be a an­ar­chist; if not, then not”). If we take into ac­count the prefer­ences of all hu­mans in some roughly equal way (see Sec­tion 4.3), then we can get “moral progress” with­out need­ing to change any­one’s in­di­vi­d­ual prefer­ences. Fi­nally, pro­fes­sional roles, con­tracts, and al­li­ances al­low for be­havi­oural changes (and some­times val­ues changes), in ways that max­imise the ini­tial val­ues. Sort of like “if I do PR work for the Anar­chist party, I will spout an­ar­chist val­ues” and “I ac­cept to make my val­ues more an­ar­chist, in ex­change for the Anar­chist party shift­ing their val­ues more to­wards mine”.

Beyond these ex­am­ples, it gets trick­ier to pre­serve moral change. We might put a slider that makes our own val­ues less in­stru­men­tal or less self­ish over time, but that feels like a cheat: we already know what we will be, we’re just tak­ing the long route to get there. Other­wise, we might al­low our val­ues to change within cer­tain defined ar­eas. This would have to be care­fully defined to pre­vent ran­dom change, but the main challenge is effi­ciency: chang­ing val­ues have an in­evitable effi­ciency cost, so there needs to be strong pos­i­tive pres­sure to pre­serve the changes—and not just pre­serve an un­used “pos­si­bil­ity for change”, but ac­tual, effi­ciency-los­ing, changes.

This should be worth in­ves­ti­gat­ing more; it feels like these con­sid­er­a­tions need to be built into the syn­the­sis pro­cess for this to work, rather than the syn­the­sis pro­ject mak­ing them work it­self (thus this kind of prefer­ences is one of the “Global meta-prefer­ences about the out­come of the syn­the­sis pro­cess”).

4.7 Other Is­sues not addressed

Th­ese are other im­por­tant is­sues that need to be solved to get a fully friendly AI, even if the re­search agenda works perfectly. They are, how­ever, be­yond the scope of this agenda; a par­tial list of these is:

  1. Ac­tu­ally build­ing the AI it­self (left as an ex­er­cise to the reader).

  2. Pop­u­la­tion ethics (though some sort of av­er­age of in­di­vi­d­ual hu­man pop­u­la­tion ethics might be doable with these meth­ods).

  3. Tak­ing into ac­count other fac­tors than in­di­vi­d­ual prefer­ences.

  4. Is­sues of on­tol­ogy and on­tol­ogy changes.

  5. Mind crime (con­scious suffer­ing be­ings simu­lated within an AI sys­tem), though some of the work on iden­tity prefer­ences may help in iden­ti­fy­ing con­scious minds.

  6. In­finite ethics.

  7. Defi­ni­tions of coun­ter­fac­tu­als or which de­ci­sion the­ory to use.

  8. Agent foun­da­tions, log­i­cal un­cer­tainty, how to keep a util­ity sta­ble.

  9. Acausal trade.

  10. Op­ti­mi­sa­tion dae­mons/​in­ner op­ti­misers/​emer­gent op­ti­mi­sa­tion.

Note that the Ma­chine In­tel­li­gence Re­search In­sti­tute is work­ing heav­ily on is­sues 7, 8, and 9.


  1. A par­tial prefer­ence be­ing a prefer­ence where the hu­man con­sid­ers only a small part of the vari­ables de­scribing the uni­verse; see Sec­tion 1.1. ↩︎

  2. Ac­tu­ally, this spe­cific prob­lem is not in­cluded di­rectly in the re­search agenda, though see Sec­tion 4.3. ↩︎

  3. Likely but not cer­tain: we don’t know how effec­tive AIs might be­come at com­put­ing coun­ter­fac­tu­als or mod­el­ling hu­mans. ↩︎

  4. It makes sense to al­low par­tial prefer­ences to con­trast a small num­ber of situ­a­tions, rather than just two. So “when it comes to watch­ing su­per­hero movies, I’d pre­fer to watch them with Alan, but Beth will do, and definitely not with Carol”. Since par­tial prefer­ences with situ­a­tions can be built out of smaller num­ber of par­tial prefer­ences with two situ­a­tions, al­low­ing more situ­a­tions is a use­ful prac­ti­cal move, but doesn’t change the the­ory. ↩︎

  5. “One-step” refers to hy­po­thet­i­cals that can be re­moved from the hu­man’s im­me­di­ate ex­pe­rience (“Imag­ine that you and your fam­ily are in space...”) but not very far re­moved (so no need for lengthy de­scrip­tions that could sway the hu­man’s opinions by hear­ing them). ↩︎

  6. Equiv­a­lently to re­duc­ing the weight, we could in­crease un­cer­tainty about the par­tial prefer­ence, given the un­fa­mil­iar­ity. There are many op­tions for for­mal­isms that lead to the same out­come. Though note that here, we are im­pos­ing a penalty (low weight/​high un­cer­tainty) for un­fa­mil­iar­ity, whereas the ac­tual hu­man might have in­cred­ibly strong in­ter­nal cer­tainty in their prefer­ences. It’s im­por­tant to dis­t­in­guish as­sump­tions that the syn­the­sis pro­cess makes, from as­sump­tions that the hu­man might make. ↩︎

  7. Ex­treme situ­a­tions are also situ­a­tions where we have to be very care­ful to en­sure the AI has the right model of all prefer­ence pos­si­bil­ities. The flaws of in­cor­rect model can be cor­rected by enough data, but when data is sparse and un­re­li­able, then model as­sump­tions—in­clud­ing prior—tend to dom­i­nate the re­sult. ↩︎

  8. “Nat­u­ral” does not, of course, mean any of “healthy”, “tra­di­tional”, or “non-pol­lut­ing”. How­ever those us­ing the term “nat­u­ral” are of­ten as­sum­ing all of those. ↩︎

  9. The hu­man’s meta-prefer­ences are also rele­vant to this it. It might be that, when­ever asked about this par­tic­u­lar con­tra­dic­tion, the hu­man would an­swer one way. There­fore ’s con­di­tional meta-prefer­ences may con­tain ways of re­solv­ing these con­tra­dic­tions, at least if the meta-prefer­ences have high weight and the prefer­ences have low weight.

    Con­di­tional meta-prefer­ences can be tricky, though, as we don’t want them to al­low the syn­the­sis to get around the one-step hy­po­thet­i­cals re­stric­tion. A “if a long the­ory sounds con­vinc­ing to me, I want to be­lieve it” meta-prefer­ence in prac­tice do away with these re­stric­tions. That par­tic­u­lar meta-prefer­ence might be can­cel­led out by the abil­ity of many differ­ent the­o­ries to sound con­vinc­ing. ↩︎

  10. We can al­low meta-prefer­ences to de­ter­mine a lot more of their own syn­the­sis if we find an ap­pro­pri­ate method that a) always reaches a syn­the­sis, and b) doesn’t ar­tifi­cially boost some prefer­ences through a feed­back effect. ↩︎