Multi-agent safety

Note: this post is most ex­plic­itly about safety in multi-agent train­ing regimes. How­ever, many of the ar­gu­ments I make are also more broadly ap­pli­ca­ble—for ex­am­ple, when train­ing a sin­gle agent in a com­plex en­vi­ron­ment, challenges aris­ing from the en­vi­ron­ment could play an analo­gous role to challenges aris­ing from other agents. In par­tic­u­lar, I ex­pect that the di­a­gram in the ‘Devel­op­ing Gen­eral In­tel­li­gence’ sec­tion will be ap­pli­ca­ble to most pos­si­ble ways of train­ing an AGI.

To build an AGI us­ing ma­chine learn­ing, it will be nec­es­sary to provide a se­quence of train­ing datasets or en­vi­ron­ments which fa­cil­i­tate the de­vel­op­ment of gen­eral cog­ni­tive skills; let’s call this a cur­ricu­lum. Cur­ricu­lum de­sign is pri­ori­tised much less in ma­chine learn­ing than re­search into novel al­gorithms or ar­chi­tec­tures; how­ever, it seems pos­si­ble that com­ing up with a cur­ricu­lum suffi­cient to train an AGI will be a very difficult task.[1] A nat­u­ral re­sponse is to try to au­to­mate cur­ricu­lum de­sign. Self-play is one method of do­ing so which has worked very well for zero-sum games such as Go, since it pro­duces tasks which are always at an ap­pro­pri­ate level of difficulty. The gen­er­al­i­sa­tion of this idea to more agents and more en­vi­ron­ments leads to the con­cept of multi-agent au­tocur­ricula, as dis­cussed by Leibo et al. (2019).[2] In this frame­work, agents de­velop in­creas­ingly so­phis­ti­cated ca­pa­bil­ities in re­sponse to changes in other agents around them, in or­der to com­pete or co­op­er­ate more effec­tively. I’m par­tic­u­larly in­ter­ested in au­tocur­ricula which oc­cur in large simu­lated en­vi­ron­ments rich enough to sup­port com­plex in­ter­ac­tions; the ex­am­ple of hu­man evolu­tion gives us very good rea­son to take this setup se­ri­ously as a pos­si­ble route to AGI.

One im­por­tant pre­dic­tion I would make about AGIs trained via multi-agent au­tocur­ricula is that their most in­ter­est­ing and in­tel­li­gent be­havi­our won’t be di­rectly in­cen­tivised by their re­ward func­tions. This is be­cause many of the se­lec­tion pres­sures ex­erted upon them will come from emer­gent in­ter­ac­tion dy­nam­ics.[3] For ex­am­ple, con­sider a group of agents trained in a vir­tual en­vi­ron­ment and re­warded for some achieve­ment in that en­vi­ron­ment, such as gath­er­ing (vir­tual) food, which puts them into com­pe­ti­tion with each other. In or­der to gather more food, they might learn to gen­er­ate the­o­ries of (simu­lated) physics, in­vent new com­mu­ni­ca­tion tech­niques, or form coal­i­tions. We should be far more in­ter­ested in those skills than in how much food they ac­tu­ally man­age to gather. But since it will be much more difficult to recog­nise and re­ward the de­vel­op­ment of those skills di­rectly, I pre­dict that ma­chine learn­ing re­searchers will train agents on re­ward func­tions which don’t have much in­trin­sic im­por­tance, but which en­courage high-level com­pe­ti­tion and co­op­er­a­tion.

Sup­pose, as seems fairly plau­si­ble to me, that this is the mechanism by which AGI arises (leav­ing aside whether it might be pos­si­ble to nudge the field of ML in a differ­ent di­rec­tion). How can we af­fect the goals which these agents de­velop, if most of their be­havi­our isn’t very sen­si­tive to the spe­cific re­ward func­tion used? One pos­si­bil­ity is that, in ad­di­tion to the au­tocur­ricu­lum-in­duc­ing re­ward func­tion, we could add an aux­iliary re­ward func­tion which pe­nal­ises un­de­sir­able be­havi­our. The abil­ity to iden­tify such be­havi­our even in su­per­in­tel­li­gent agents is a goal of scal­able over­sight tech­niques like re­ward mod­el­ling, IDA, and de­bate. How­ever, these tech­niques are usu­ally pre­sented in the con­text of train­ing an agent to perform well on a task. In open-ended simu­lated en­vi­ron­ments, it’s not clear what it even means for be­havi­our to be de­sir­able or un­de­sir­able. The tasks the agents will be do­ing in simu­la­tion likely won’t cor­re­spond very di­rectly to eco­nom­i­cally use­ful real-world tasks, or any­thing we care about for its own sake. Rather, the pur­pose of those simu­lated tasks will merely be to train the agent to learn gen­eral cog­ni­tive skills.

Devel­op­ing gen­eral intelligence

To ex­plain this claim, it’s use­ful to con­sider the evolu­tion of hu­mans, as sum­marised on a very ab­stract level in the di­a­gram be­low. We first went through a long pe­riod of be­ing “trained” by evolu­tion—not just to do spe­cific tasks like run­ning and climb­ing, but also to gain gen­eral cog­ni­tive skills such as ab­strac­tion, long-term mem­ory, and the­ory of mind (hence why I’ve la­bel­led this the “meta-train­ing” phase). Note that al­most none of to­day’s eco­nom­i­cally rele­vant tasks were di­rectly se­lected for in our an­ces­tral en­vi­ron­ment—how­ever, start­ing from the skills and mo­ti­va­tions which have been in­grained into us, it takes rel­a­tively lit­tle ad­di­tional “fine-tun­ing” for us to do well at them (only a few years of learn­ing, rather than mil­len­nia of fur­ther evolu­tion). Similarly, agents which have de­vel­oped the right cog­ni­tive skills will need rel­a­tively lit­tle ad­di­tional train­ing to learn to perform well on eco­nom­i­cally valuable tasks.


(apolo­gies if the image ap­pears small, I’m not sure how to fix it).

Need­ing only a small amount of fine-tun­ing might at first ap­pear use­ful for safety pur­poses, since it means the cost of su­per­vis­ing train­ing on real-world tasks would be lower. How­ever, in this paradigm the key safety con­cern is that the agent de­vel­ops the wrong core mo­ti­va­tions. If this oc­curs, a small amount of fine-tun­ing is un­likely to re­li­ably change those mo­ti­va­tions—for roughly the same rea­sons that hu­mans’ core biolog­i­cal im­per­a­tives are fairly ro­bust. Con­sider, for in­stance, an agent which de­vel­oped the core mo­ti­va­tion of amass­ing re­sources be­cause that was re­li­ably use­ful dur­ing ear­lier train­ing. When fine-tuned on a real-world task in which we don’t want it to hoard re­sources for it­self (e.g. be­ing a CEO), it could ei­ther dis­card the goal of amass­ing re­sources, or else re­al­ise that the best way to achieve that goal in the long term is to feign obe­di­ence un­til it has more power. In ei­ther case, we will end up with an agent which ap­pears to be a good CEO—but in the lat­ter case, that agent will be un­safe in the long term. Wor­ry­ingly, the lat­ter also seems more likely, since it only re­quires one ad­di­tional in­fer­ence—as op­posed to the former, which in­volves re­mov­ing a goal that had been fre­quently re­in­forced through­out the very long meta-train­ing pe­riod. This ar­gu­ment is par­tic­u­larly ap­pli­ca­ble to core mo­ti­va­tions which were ro­bustly use­ful in al­most any situ­a­tion which arose in the multi-agent train­ing en­vi­ron­ment; I ex­pect gath­er­ing re­sources and build­ing coal­i­tions to fall into this cat­e­gory.

I think GPT-3 is, out of our cur­rent AIs, that comes clos­est to in­stan­ti­at­ing this di­a­gram. How­ever, I’m not sure if it’s use­ful yet to de­scribe it as hav­ing “mo­ti­va­tions”; and its mem­ory isn’t long enough to build up cul­tural knowl­edge that wasn’t part of the origi­nal meta-train­ing pro­cess.

Shap­ing agents’ goals

So if we want to make agents safe by su­per­vis­ing them dur­ing the long meta-train­ing phase (i.e. the pe­riod of multi-agent au­tocur­ricu­lum train­ing de­scribed above), we need to re­frame the goal of scal­able over­sight tech­niques. In­stead of sim­ply recog­nis­ing de­sir­able and un­de­sir­able be­havi­our, which may not be well-defined con­cepts in the train­ing en­vi­ron­ment, their goal is to cre­ate ob­jec­tive func­tions which lead to the agent hav­ing de­sir­able mo­ti­va­tions. In par­tic­u­lar, the mo­ti­va­tion to be obe­di­ent to hu­mans seems like a cru­cial one. The most straight­for­ward way I en­visage in­still­ing this is by in­clud­ing in­struc­tions from hu­mans (or hu­man avatars) in the vir­tual en­vi­ron­ment, with a large re­ward or penalty for obey­ing or di­s­obey­ing those in­struc­tions. It’s im­por­tant that the in­struc­tions fre­quently op­pose the AGIs’ ex­ist­ing core mo­ti­va­tions, to weaken the cor­re­la­tion be­tween re­wards and any be­havi­our apart from fol­low­ing hu­man in­struc­tions di­rectly. How­ever, the in­struc­tions may have noth­ing to do with the be­havi­our we’d like agents to carry out in the real world. In fact, it may be benefi­cial to in­clude in­struc­tions which, if car­ried out in the real world, would be in di­rect op­po­si­tion to our usual prefer­ences—again, to make it more likely that agents will learn to pri­ori­tise fol­low­ing in­struc­tions over any other mo­ti­va­tion.

We can see this pro­posal as “one level up” from stan­dard scal­able over­sight tech­niques: in­stead of us­ing scal­able over­sight to di­rectly re­in­force be­havi­our hu­mans value, I claim we should use it to re­in­force the more gen­eral mo­ti­va­tion of be­ing obe­di­ent to hu­mans. When train­ing AGIs us­ing the lat­ter ap­proach, it is im­por­tant that they re­ceive com­mands which come very clearly and di­rectly from hu­mans, so that they are more eas­ily able to in­ter­nal­ise the con­cept of obe­di­ence to us. (As an illus­tra­tion of this point, con­sider that evolu­tion failed to mo­ti­vate hu­mans to pur­sue in­clu­sive ge­netic fit­ness di­rectly, be­cause it was too ab­stract a con­cept for our mo­ti­va­tional sys­tems to eas­ily ac­quire. Giv­ing in­struc­tions very di­rectly might help us avoid analo­gous prob­lems.)

Of course this ap­proach re­lies heav­ily on AGIs gen­er­al­is­ing the con­cept of “obe­di­ence” to real-world tasks. Un­for­tu­nately, I think that rely­ing on gen­er­al­i­sa­tion is likely to be nec­es­sary for any com­pet­i­tive safety pro­posal. But I hope that obe­di­ence is an un­usu­ally easy con­cept to teach agents to gen­er­al­ise well, be­cause it re­lies on other con­cepts that may nat­u­rally arise dur­ing multi-agent train­ing—and be­cause we may be able to make struc­tural mod­ifi­ca­tions to multi-agent train­ing en­vi­ron­ments to push agents to­wards ro­bustly learn­ing these con­cepts. I’ll dis­cuss this ar­gu­ment in more de­tail in a fol­low-up post.


  1. As ev­i­dence for this, note that we have man­aged to train agents which do very well on hard tasks like Go, Star­craft and lan­guage mod­el­ling, but which don’t seem to have very gen­eral cog­ni­tive skills. ↩︎

  2. Joel Z. Leibo, Ed­ward Hughes, Marc Lanc­tot, Thore Grae­pel. 2019. Au­tocur­ricula and the Emer­gence of In­no­va­tion from So­cial In­ter­ac­tion: A Man­i­festo for Multi-Agent In­tel­li­gence Re­search. ↩︎

  3. This point is dis­tinct from Bostrom’s ar­gu­ment about con­ver­gent in­stru­men­tal goals, be­cause the lat­ter ap­plies to an agent which already has some goals, whereas my ar­gu­ment is about the pro­cess by which an agent is trained to ac­quire goals. ↩︎