Thoughts on Human Models

Hu­man val­ues and prefer­ences are hard to spec­ify, es­pe­cially in com­plex do­mains. Ac­cord­ingly, much AGI safety re­search has fo­cused on ap­proaches to AGI de­sign that re­fer to hu­man val­ues and prefer­ences in­di­rectly, by learn­ing a model that is grounded in ex­pres­sions of hu­man val­ues (via stated prefer­ences, ob­served be­havi­our, ap­proval, etc.) and/​or real-world pro­cesses that gen­er­ate ex­pres­sions of those val­ues. There are ad­di­tion­ally ap­proaches aimed at mod­el­ling or imi­tat­ing other as­pects of hu­man cog­ni­tion or be­havi­our with­out an ex­plicit aim of cap­tur­ing hu­man prefer­ences (but usu­ally in ser­vice of ul­ti­mately satis­fy­ing them). Let us re­fer to all these mod­els as hu­man mod­els.

In this post, we dis­cuss sev­eral rea­sons to be cau­tious about AGI de­signs that use hu­man mod­els. We sug­gest that the AGI safety re­search com­mu­nity put more effort into de­vel­op­ing ap­proaches that work well in the ab­sence of hu­man mod­els, alongside the ap­proaches that rely on hu­man mod­els. This would be a sig­nifi­cant ad­di­tion to the cur­rent safety re­search land­scape, es­pe­cially if we fo­cus on work­ing out and try­ing con­crete ap­proaches as op­posed to de­vel­op­ing the­ory. We also ac­knowl­edge var­i­ous rea­sons why avoid­ing hu­man mod­els seems difficult.

Prob­lems with Hu­man Models

To be clear about hu­man mod­els, we draw a rough dis­tinc­tion be­tween our ac­tual prefer­ences (which may not be fully ac­cessible to us) and pro­ce­dures for eval­u­at­ing our prefer­ences. The first thing, ac­tual prefer­ences, is what hu­mans ac­tu­ally want upon re­flec­tion. Satis­fy­ing our ac­tual prefer­ences is a win. The sec­ond thing, pro­ce­dures for eval­u­at­ing prefer­ences, refers to var­i­ous prox­ies for our ac­tual prefer­ences such as our ap­proval, or what looks good to us (with nec­es­sar­ily limited in­for­ma­tion or time for think­ing). Hu­man mod­els are in the sec­ond cat­e­gory; con­sider, as an ex­am­ple, a highly ac­cu­rate ML model of hu­man yes/​no ap­proval on the set of de­scrip­tions of out­comes. Our first con­cern, de­scribed be­low, is about overfit­ting to hu­man ap­proval and thereby break­ing its con­nec­tion to our ac­tual prefer­ences. (This is a case of Good­hart’s law.)

Less In­de­pen­dent Audits

Imag­ine we have built an AGI sys­tem and we want to use it to de­sign the mass tran­sit sys­tem for a new city. The safety prob­lems as­so­ci­ated with such a pro­ject are well recog­nised; sup­pose we are not com­pletely sure we have solved them, but are con­fi­dent enough to try any­way. We run the sys­tem in a sand­box on some fake city in­put data and ex­am­ine its out­puts. Then we run it on some more out­landish fake city data to as­sess ro­bust­ness to dis­tri­bu­tional shift. The AGI’s out­puts look like rea­son­able tran­sit sys­tem de­signs and con­sid­er­a­tions, and in­clude ar­gu­ments, met­rics, and other sup­port­ing ev­i­dence that they are good. Should we be satis­fied and ready to run the sys­tem on the real city’s data, and to im­ple­ment the re­sult­ing pro­posed de­sign?

We sug­gest that an im­por­tant fac­tor in the an­swer to this ques­tion is whether the AGI sys­tem was built us­ing hu­man mod­el­ling or not. If it pro­duced a solu­tion to the tran­sit de­sign prob­lem (that hu­mans ap­prove of) with­out hu­man mod­el­ling, then we would more read­ily trust its out­puts. If it pro­duced a solu­tion we ap­prove of with hu­man mod­el­ling, then al­though we ex­pect the out­puts to be in many ways about good tran­sit sys­tem de­sign (our ac­tual prefer­ences) and in many ways suited to be­ing ap­proved by hu­mans, to the ex­tent that these two tar­gets come apart we must worry about hav­ing overfit to the hu­man model at the ex­pense of the good de­sign. (Why not the other way around? Be­cause our as­sess­ment of the sand­boxed re­sults uses hu­man judge­ment, not an in­de­pen­dent met­ric for satis­fac­tion of our ac­tual prefer­ences.)

Hu­mans have a prefer­ence for not be­ing wrong about the qual­ity of a de­sign, let alone be­ing fooled about it. How much do we want to rely on hav­ing cor­rectly cap­tured these prefer­ences in our sys­tem? If the sys­tem is mod­el­ling hu­mans, we strongly rely on the sys­tem learn­ing and satis­fy­ing these prefer­ences, or else we ex­pect to be fooled to the ex­tent that a good-look­ing but ac­tu­ally bad tran­sit sys­tem de­sign is eas­ier to com­pose than an ac­tu­ally-good de­sign. On the other hand, if the sys­tem is not mod­el­ling hu­mans, then the fact that its out­put looks like a good de­sign is bet­ter ev­i­dence that it is in fact a good de­sign. In­tu­itively, if we con­sider sam­pling pos­si­ble out­puts and con­di­tion on the out­put look­ing good (via knowl­edge of hu­mans), the prob­a­bil­ity of it be­ing good (via knowl­edge of the do­main) is higher when the sys­tem’s knowl­edge is more about what is good than what looks good.

Here is a han­dle for this prob­lem: a de­sire for an in­de­pen­dent au­dit of the sys­tem’s out­puts. When a sys­tem uses hu­man mod­el­ling, the mu­tual in­for­ma­tion be­tween its out­puts and the au­dit­ing pro­cess (hu­man judge­ment) is higher. Thus, us­ing hu­man mod­els re­duces our abil­ity to do in­de­pen­dent au­dits.

Avoid­ing hu­man mod­els does not avoid this prob­lem al­to­gether. There is still an “outer-loop op­ti­mi­sa­tion” ver­sion of the prob­lem. If the sys­tem pro­duces a weird or flawed de­sign in sand­box, and we iden­tify this dur­ing an au­dit, we will prob­a­bly re­ject the solu­tion and at­tempt to de­bug the sys­tem that pro­duced it. This in­tro­duces a bias on the over­all pro­cess (in­volv­ing mul­ti­ple ver­sions of the sys­tem over phases of au­dit­ing and de­bug­ging) to­wards out­puts that fool our au­dit­ing pro­ce­dure.

How­ever, outer-loop op­ti­mi­sa­tion pres­sures are weaker, and there­fore less wor­ry­ing, than in-loop op­ti­mi­sa­tion pres­sures. We would ar­gue that the prob­lem is much worse, i.e., the bias to­wards fool­ing is stronger, when one uses hu­man mod­el­ling. This is be­cause the rele­vant op­ti­mi­sa­tion is in-loop in­stead and is en­coun­tered more of­ten.

As one more anal­ogy to illus­trate this point, con­sider a clas­sic Good­hart’s law ex­am­ple of teach­ing to the test. If you study the ma­te­rial, then take a test, your test score re­veals your knowl­edge of the ma­te­rial fairly well. If you in­stead study past tests, your test score re­veals your abil­ity to pass tests, which may be cor­re­lated with your knowl­edge of the ma­te­rial but is in­creas­ingly less likely to be so cor­re­lated as your score goes up. Here hu­man mod­el­ling is analo­gous to past tests and ac­tual prefer­ences are analo­gous to the ma­te­rial. Tak­ing the test is analo­gous to an au­dit, which we want to be in­de­pen­dent from the study reg­i­men.

Risk from Bugs

We might im­ple­ment our first AGI sys­tem in­cor­rectly in a mun­dane sense. Speci­fi­cally, even if we fully de­velop a the­ory of safe or al­igned AGI, we might fail to im­ple­ment that the­ory due to bugs or prob­lems with our im­ple­men­ta­tion tech­niques. In this case, we would be rel­a­tively bet­ter off if the mu­tual in­for­ma­tion be­tween the AGI’s knowl­edge and hu­man prefer­ences is low. We ex­pect the sys­tem’s be­havi­our to be de­pen­dent on its knowl­edge in some way, and we ex­pect im­ple­men­ta­tion er­rors to shift the na­ture of that de­pen­dence away from our in­ten­tions and ex­pec­ta­tions. In­cor­rect be­havi­our that de­pends on hu­man prefer­ences seems more dan­ger­ous than in­cor­rect be­havi­our that does not.

Con­sider the space of AGI sys­tem im­ple­men­ta­tions, un­der a met­ric like similar­ity to an in­tended de­sign (equiv­a­lently: sever­ity of de­vi­a­tion from the de­sign due to bugs). We want all the points near the first AGI sys­tem we build to be safe, be­cause we may end up with a slightly differ­ent de­sign than in­tended for rea­sons such as be­ing con­fused about what we are do­ing or mak­ing im­ple­men­ta­tion er­rors.

There are at least three ways in which the risk from bugs can man­i­fest.

In­cor­rectly En­coded Values

Sup­pos­ing we in­tend the first use of AGI to be solv­ing some bounded and well-speci­fied task, but we mi­s­un­der­stand or badly im­ple­ment it so much that what we end up with is ac­tu­ally un­bound­edly op­ti­mis­ing some ob­jec­tive func­tion. Then it seems bet­ter if that ob­jec­tive is some­thing ab­stract like puz­zle solv­ing rather than some­thing more di­rectly con­nected to hu­man prefer­ences: con­sider, as a toy ex­am­ple, if the sign (pos­i­tive/​nega­tive) around the ob­jec­tive were wrong.


The ear­lier ar­gu­ments for in­de­pen­dent au­dits do not just ap­ply to the spe­cific tasks we would plan to au­dit, but also to any ac­tivi­ties an AGI sys­tem might carry out that hu­mans might dis­ap­prove of. Ex­am­ples in­clude find­ing ways to hack into our sup­pos­edly se­cure sys­tems, hid­ing its in­ten­tions and ac­tivity from us, or out­right ma­nipu­lat­ing us. Th­ese tasks are much eas­ier with ac­cess to a good psy­cholog­i­cal model of hu­mans, which can be used to in­fer what mis­takes we might make, or what loop­holes we might over­look, or how we might re­spond to differ­ent be­havi­our from the sys­tem.

Hu­man mod­el­ling is very close to hu­man ma­nipu­la­tion in de­sign space. A sys­tem with ac­cu­rate mod­els of hu­mans is close to a sys­tem which suc­cess­fully uses those mod­els to ma­nipu­late hu­mans.


Another risk from bugs comes not from the AGI sys­tem car­ing in­cor­rectly about our val­ues, but from hav­ing in­ad­e­quate se­cu­rity. If our val­ues are ac­cu­rately en­coded in an AGI sys­tem that cares about satis­fy­ing them, they be­come a tar­get for threats from other ac­tors who can gain from ma­nipu­lat­ing the first sys­tem. More ex­am­ples and per­spec­tives on this prob­lem have been de­scribed here.

The in­creased risk from bugs of hu­man mod­el­ling can be sum­marised as fol­lows: what­ever the risk that AGI sys­tems pro­duce catas­trophic out­comes due to bugs, the very worst out­comes seem more likely if the sys­tem was trained us­ing hu­man mod­el­ling be­cause these worst out­comes de­pend on the in­for­ma­tion in hu­man mod­els.

Less in­de­pen­dent au­dits and the risk from bugs can both be miti­gated by pre­serv­ing in­de­pen­dence of the sys­tem from hu­man model in­for­ma­tion, so the sys­tem can­not overfit to that in­for­ma­tion or use it per­versely. The re­main­ing two prob­lems we con­sider, mind crime and un­ex­pected agents, de­pend more heav­ily on the claim that mod­el­ling hu­man prefer­ences in­creases the chances of simu­lat­ing some­thing hu­man-like.

Mind Crime

Many com­pu­ta­tions may pro­duce en­tities that are morally rele­vant be­cause, for ex­am­ple, they con­sti­tute sen­tient be­ings that ex­pe­rience pain or plea­sure. Bostrom calls im­proper treat­ment of such en­tities “mind crime”. Model­ling hu­mans in some form seems more likely to re­sult in such a com­pu­ta­tion than not mod­el­ling them, since hu­mans are morally rele­vant and the sys­tem’s mod­els of hu­mans may end up shar­ing what­ever prop­er­ties make hu­mans morally rele­vant.

Un­ex­pected Agents

Similar to the mind crime point above, we ex­pect AGI de­signs that use hu­man mod­el­ling to be more at risk of pro­duc­ing sub­sys­tems that are agent-like, be­cause hu­mans are agent-like. For ex­am­ple, we note that try­ing to pre­dict the out­put of con­se­quen­tial­ist rea­son­ers can re­duce to an op­ti­mi­sa­tion prob­lem over a space of things that con­tains con­se­quen­tial­ist rea­son­ers. A sys­tem en­g­ineered to pre­dict hu­man prefer­ences well seems strictly more likely to run into prob­lems as­so­ci­ated with mis­al­igned sub-agents. (Nev­er­the­less, we think the amount by which it is more likely is small.)

Safe AGI Without Hu­man Models is Neglected

Given the in­de­pen­dent au­dit­ing con­cern, plus the ad­di­tional points men­tioned above, we would like to see more work done on prac­ti­cal ap­proaches to de­vel­op­ing safe AGI sys­tems that do not de­pend on hu­man mod­el­ling. At pre­sent, this is a ne­glected area in the AGI safety re­search land­scape. Speci­fi­cally, work of the form “Here’s a pro­posed ap­proach, here are the next steps to try it out or in­ves­ti­gate fur­ther”, which we might term en­g­ineer­ing-fo­cused re­search, is al­most en­tirely done in a hu­man-mod­el­ling con­text. Where we do see some safety work that es­chews hu­man mod­el­ling, it tends to be the­ory-fo­cused re­search, for ex­am­ple, MIRI’s work on agent foun­da­tions. This does not fill the gap of en­g­ineer­ing-fo­cused work on safety with­out hu­man mod­els.

To flesh out the claim of a gap, con­sider the usual for­mu­la­tions of each of the fol­low­ing efforts within safety re­search: iter­ated dis­til­la­tion and am­plifi­ca­tion, de­bate, re­cur­sive re­ward mod­el­ling, co­op­er­a­tive in­verse re­in­force­ment learn­ing, and value learn­ing. In each case, there is hu­man mod­el­ling built into the ba­sic setup for the ap­proach. How­ever, we note that the tech­ni­cal re­sults in these ar­eas may in some cases be trans­portable to a setup with­out hu­man mod­el­ling, if the source of hu­man feed­back (etc.) is re­placed with a purely al­gorith­mic, in­de­pen­dent sys­tem.

Some ex­ist­ing work that does not rely on hu­man mod­el­ling in­cludes the for­mu­la­tion of safely in­ter­rupt­ible agents, the for­mu­la­tion of im­pact mea­sures (or side effects), ap­proaches in­volv­ing build­ing AI sys­tems with clear for­mal speci­fi­ca­tions (e.g., some ver­sions of tool AIs), some ver­sions of or­a­cle AIs, and box­ing/​con­tain­ment. Although they do not rely on hu­man mod­el­ling, some of these ap­proaches nev­er­the­less make most sense in a con­text where hu­man mod­el­ling is hap­pen­ing: for ex­am­ple, im­pact mea­sures seem to make most sense for agents that will be op­er­at­ing di­rectly in the real world, and such agents are likely to re­quire hu­man mod­el­ling. Nev­er­the­less, we would like to see more work of all these kinds, as well as new tech­niques for build­ing safe AGI that does not rely on hu­man mod­el­ling.

Difficul­ties in Avoid­ing Hu­man Models

A plau­si­ble rea­son why we do not yet see much re­search on how to build safe AGI with­out hu­man mod­el­ling is that it is difficult. In this sec­tion, we de­scribe some dis­tinct ways in which it is difficult.


It is not ob­vi­ous how to put a sys­tem that does not do hu­man mod­el­ling to good use. At least, it is not as ob­vi­ous as for the sys­tems that do hu­man mod­el­ling, since they draw di­rectly on sources (e.g., hu­man prefer­ences) of in­for­ma­tion about use­ful be­havi­our. In other words, it is un­clear how to solve the speci­fi­ca­tion prob­lem—how to cor­rectly spec­ify de­sired (and only de­sired) be­havi­our in com­plex do­mains—with­out hu­man mod­el­ling. The “against hu­man mod­el­ling” stance calls for a solu­tion to the speci­fi­ca­tion prob­lem wherein use­ful tasks are trans­formed into well-speci­fied, hu­man-in­de­pen­dent tasks ei­ther solely by hu­mans or by sys­tems that do not model hu­mans.

To illus­trate, sup­pose we have solved some well-speci­fied, com­plex but hu­man-in­de­pen­dent task like the­o­rem prov­ing or atom­i­cally pre­cise man­u­fac­tur­ing. Then how do we lev­er­age this solu­tion to pro­duce a good (or bet­ter) fu­ture? Em­pow­er­ing ev­ery­one, or even a few peo­ple, with ac­cess to a su­per­in­tel­li­gent sys­tem that does not di­rectly en­code their val­ues in some way does not ob­vi­ously pro­duce a fu­ture where those val­ues are re­al­ised. (This seems re­lated to Wei Dai’s hu­man-safety prob­lem.)

Im­plicit Hu­man Models

Even seem­ingly “in­de­pen­dent” tasks leak at least a lit­tle in­for­ma­tion about their ori­gins in hu­man mo­ti­va­tions. Con­sider again the mass tran­sit sys­tem de­sign prob­lem. Since the prob­lem it­self con­cerns the de­sign of a sys­tem for use by hu­mans, it seems difficult to avoid mod­el­ling hu­mans at all in spec­i­fy­ing the task. More sub­tly, even highly ab­stract or generic tasks like puz­zle solv­ing con­tain in­for­ma­tion about the sources/​de­sign­ers of the puz­zles, es­pe­cially if they are tuned for en­cod­ing more ob­vi­ously hu­man-cen­tred prob­lems. (Work by Shah et al. looks at us­ing the in­for­ma­tion about hu­man prefer­ences that is la­tent in the world.)

Speci­fi­ca­tion Com­pet­i­tive­ness /​ Do What I Mean

Ex­plicit speci­fi­ca­tion of a task in the form of, say, an op­ti­mi­sa­tion ob­jec­tive (of which a re­in­force­ment learn­ing prob­lem would be a spe­cific case) is known to be frag­ile: there are usu­ally things we care about that get left out of ex­plicit speci­fi­ca­tions. This is one of the mo­ti­va­tions for seek­ing more and more high level and in­di­rect speci­fi­ca­tions, leav­ing more of the work of figur­ing out what ex­actly is to be done to the ma­chine. How­ever, it is cur­rently hard to see how to au­to­mate the pro­cess of turn­ing tasks (vaguely defined) into cor­rect speci­fi­ca­tions with­out mod­el­ling hu­mans.

Perfor­mance Com­pet­i­tive­ness of Hu­man Models

It could be that mod­el­ling hu­mans is the best way to achieve good perfor­mance on var­i­ous tasks we want to ap­ply AGI sys­tems to for rea­sons that are not sim­ply to do with un­der­stand­ing the prob­lem speci­fi­ca­tion well. For ex­am­ple, there may be as­pects of hu­man cog­ni­tion that we want to more or less repli­cate in an AGI sys­tem, for com­pet­i­tive­ness at au­tomat­ing those cog­ni­tive func­tions, and those as­pects may carry a lot of in­for­ma­tion about hu­man prefer­ences with them in a hard to sep­a­rate way.

What to Do Without Hu­man Models?

We have seen ar­gu­ments for and against as­piring to solve AGI safety us­ing hu­man mod­el­ling. Look­ing back on these ar­gu­ments, we note that to the ex­tent that hu­man mod­el­ling is a good idea, it is im­por­tant to do it very well; to the ex­tent that it is a bad idea, it is best to not do it at all. Thus, whether or not to do hu­man mod­el­ling at all is a con­figu­ra­tion bit that should prob­a­bly be set early when con­ceiv­ing of an ap­proach to build­ing safe AGI.

It should be noted that the ar­gu­ments above are not in­tended to be de­ci­sive, and there may be coun­ter­vailing con­sid­er­a­tions which mean we should pro­mote the use of hu­man mod­els de­spite the risks out­lined in this post. How­ever, to the ex­tent that AGI sys­tems with hu­man mod­els are more dan­ger­ous than those with­out, there are two broad lines of in­ter­ven­tion we might at­tempt. Firstly, it may be worth­while to try to de­crease the prob­a­bil­ity that ad­vanced AI de­vel­ops hu­man mod­els “by de­fault”, by pro­mot­ing some lines of re­search over oth­ers. For ex­am­ple, an AI trained in a pro­ce­du­rally-gen­er­ated vir­tual en­vi­ron­ment seems sig­nifi­cantly less likely to de­velop hu­man mod­els than an AI trained on hu­man-gen­er­ated text and video data.

Se­condly, we can fo­cus on safety re­search that does not re­quire hu­man mod­els, so that if we even­tu­ally build AGI sys­tems that are highly ca­pa­ble with­out us­ing hu­man mod­els, we can make them safer with­out need­ing to teach them to model hu­mans. Ex­am­ples of such re­search, some of which we men­tioned ear­lier, in­clude de­vel­op­ing hu­man-in­de­pen­dent meth­ods to mea­sure nega­tive side effects, to pre­vent speci­fi­ca­tion gam­ing, to build se­cure ap­proaches to con­tain­ment, and to ex­tend the use­ful­ness of task-fo­cused sys­tems.

Ac­knowl­edge­ments: thanks to Daniel Koko­ta­jlo, Rob Bens­inger, Richard Ngo, Jan Leike, and Tim Ge­newein for helpful com­ments on drafts of this post.