How the MtG Color Wheel Explains AI Safety

Dun­can Sa­bien has a post ti­tled How the ‘Magic: The Gather­ing’ Color Wheel Ex­plains Hu­man­ity. Without the con­text of that post, or other ex­pe­rience with the MtG color wheel, this post will prob­a­bly not make sense. This post may not make sense any­way. I will use a type of anal­y­sis that is some­times used to talk about hu­mans (and of­ten crit­i­cized even when used for hu­mans), but rarely used in any tech­ni­cal sub­jects. I will ab­stract so far that ev­ery­thing will start to look like (al­most) ev­ery­thing else. I will use wrong cat­e­gories and stretch facts to make them look like they fit into my on­tol­ogy.

I will de­scribe 5 clusters of ideas in AI and AI safety, which cor­re­spond to the 5 Magic the Gather­ing col­ors. Each color will also come along with a failure mode. For each failure mode, the two op­pos­ing col­ors (on the op­po­site side of the pen­tagon) form a col­lec­tion of tools and prop­er­ties that might be use­ful for fight­ing that failure mode.

Mu­ta­tion and Selection

So I want to make an AI that can ac­com­plish some difficult task with­out try­ing to kill me (or at least with­out suc­ceed­ing in kil­ling me). Let’s con­sider the toy task of de­sign­ing a rocket. First, I need a good met­ric of what it means to be a good rocket de­sign. Then, I need to search over all the space of po­ten­tial rocket de­signs, and find one that scores well ac­cord­ing to my met­ric. I claim that search is made of two pieces: Mu­ta­tion and Selec­tion, Ex­plo­ra­tion and Op­ti­miza­tion, or Bab­ble and Prune.

Mu­ta­tion and Selec­tion are of­ten thought of as com­po­nents of the pro­cess of evolu­tion. Genes spin off slightly mod­ified copies over time through mu­ta­tion, and se­lec­tion re­peat­edly throws out the genes that score badly ac­cord­ing to a fit­ness met­ric (that is it­self chang­ing over time). The re­sult is that you find genes that are very fit for sur­vival.

How­ever, I claim that mu­ta­tion and se­lec­tion are much more gen­eral than evolu­tion. Gra­di­ent de­scent is very close to (a speed up of) the fol­low­ing pro­cess. Take an ini­tial point called your cur­rent best point. Sam­ple a large num­ber of points within an ep­silon ball of the cur­rent best point. Select the best of the sam­pled points ac­cord­ing to some met­ric. Call the se­lected point the new cur­rent best point, and re­peat.

I am not try­ing to claim that ma­chine learn­ing is sim­ple be­cause it is mostly just mu­ta­tion and se­lec­tion. Rather, I am try­ing to claim that many of the com­plex­ities of ma­chine learn­ing can be viewed as try­ing to figure out how to do mu­ta­tion and se­lec­tion well.

Good­hart­ing is a prob­lem that ar­rises when ex­treme op­ti­miza­tion goes unchecked, es­pe­cially when the op­ti­miza­tion is much stronger than the pro­cess that chose the proxy that was be­ing op­ti­mized for.

Similarly, unchecked ex­plo­ra­tion can also lead to prob­lems. This is es­pe­cially true for sys­tems that are very pow­er­ful, and can take ir­re­versible ac­tions that have not been suffi­ciently op­ti­mized. This could show up as a per­sonal robot ac­ci­den­tally kil­ling a hu­man user when it gets con­fused, or as a pow­er­ful agent ex­plor­ing into tak­ing ac­tions that de­stroy them­selves. I will re­fer to this prob­lem as ir­re­versible ex­plo­ra­tion.


The pro­cess de­scribed above is how I want to find my rocket de­sign. The prob­lem is that this search is not sta­ble or ro­bust to scale. It is set­ting up an in­ter­nal pres­sure for con­se­quen­tial­ism, and if that con­se­quen­tial­ism is re­al­ized, it might in­terfere with the in­tegrity of the search.

By con­se­quen­tial­ism, I am not talk­ing about the moral frame­work. It is similar to the moral frame­work, but it should not have any con­no­ta­tions of moral­ity. In­stead I am talk­ing about the pro­cess of rea­son­ing about the con­se­quences of po­ten­tial ac­tions, and choos­ing ac­tions based on those con­se­quences. Other phrases I may use for to de­scribe this pro­cess in­clude agency, do­ing things on pur­pose, and back-chain­ing.

Let’s go back to the evolu­tion anal­ogy. Many tools have evolved to perform many sub­goals of sur­vival, but one of the most in­fluen­tial tools to evolve was the mind. The rea­son the mind was so use­ful was be­cause it was able to op­ti­mize on a tighter feed­back cy­cle than evolu­tion it­self. In­stead of us­ing genes that en­code differ­ent strate­gies for gath­er­ing food and keep­ing the ones that work, the mind can rea­son about differ­ent strate­gies for gath­er­ing food, try things, see which ways work, and gen­er­al­ize across do­mains, all within a sin­gle gen­er­a­tion. The best way for evolu­tion to gather food is to cre­ate a pro­cess that uses a feed­back loop that is un­available to evolu­tion di­rectly to im­prove the food gath­er­ing pro­cess. This pro­cess is im­ple­mented us­ing a goal. The mind has a goal of gath­er­ing food. The outer evolu­tion pro­cess need not learn by trial and er­ror. It can just choose the minds, and let the minds gather the food. This is more effi­cient, so it wins.

It is worth not­ing that gath­er­ing food might only be a sub­goal for evolu­tion, and it could still be lo­cally worth­while to cre­ate minds that rea­son ter­mi­nally about gath­er­ing food. In fact, rea­son­ing about gath­er­ing food might be more effi­cient than rea­son­ing about ge­netic fit­ness.

Mu­ta­tion and se­lec­tion to­gether form an outer search pro­cess that finds things that score well ac­cord­ing to some met­ric. Con­se­quen­tial­ism is a generic way to score well on any given met­ric: choose ac­tions on pur­pose that score well ac­cord­ing to that met­ric (or a similar met­ric). It is hard to draw the line be­tween things that score well by ac­ci­dent, and things that score well on pur­pose, so when we try to search over things that score well by ac­ci­dent, we find things that score well on pur­pose. Note that the con­se­quen­tial­ism might it­self be built out of a mu­ta­tion and se­lec­tion pro­cess on a differ­ent meta level, but the point is that it is search­ing over things to choose be­tween us­ing a score that rep­re­sents the con­se­quences of choos­ing those things. From the point of view of the outer search pro­cess, it will just look like a thing that scores well.

So a naive search try­ing to solve a hard prob­lem may find things that are them­selves us­ing con­se­quen­tial­ism. This is a prob­lem for my rocket de­sign task, be­cause I was try­ing to be the con­se­quen­tial­ist, and I was try­ing to just use the search as a tool to ac­com­plish my goal of get­ting to the moon. When I make con­se­quen­tial­ism with­out be­ing very care­ful to en­sure it is pointed in the same di­rec­tion as I am, I cre­ate a con­flict. This is a con­flict that I might lose, and that is the prob­lem. I will re­fer to con­se­quen­tial­ism aris­ing within a pow­er­ful search pro­cess as in­ner op­ti­miz­ers or dae­mons.

Box­ing and Mildness

This is where AI safety comes in. Note that this is a de­scrip­tive anal­y­sis of AI safety, and not nec­es­sar­ily a pre­scrip­tive one. Some ap­proaches to AI safety at­tempt to com­bat dae­mons and ir­re­versible ex­plo­ra­tion through struc­ture and re­stric­tions. The cen­tral ex­am­ple in this cluster is AI box­ing. We put the AI in a box, and if it starts to be­have badly, we shut it off. This way, if a dae­mon comes out of our op­ti­miza­tion pro­cess, it won’t be able to mess up the out­side world. I ob­vi­ously don’t put too much weight in some­thing like that work­ing, but box­ing is a pretty good strat­egy for deal­ing with ir­re­versible ex­plo­ra­tion. If you want to try a thing that may have bad con­se­quences, you can spin up a sand­box in­side your head that is sup­posed to model the real world, you can try the thing in the sand­box, and if it messes things up in your sand­box, don’t try it in the real world. I think this is ac­tu­ally a large part of how we can learn in the real world with­out bad con­se­quences. (This is ac­tu­ally a com­bi­na­tion of box­ing and se­lec­tion to­gether fight­ing against ir­re­versible ex­plo­ra­tion.)

Other strate­gies I want to put in this cluster in­clude for­mal ver­ifi­ca­tion, in­formed over­sight and fac­tor­iza­tion. By fac­tor­iza­tion, I am talk­ing about things like fac­tored cog­ni­tion and com­pre­hen­sive AI ser­vices. In both cases, prob­lems are bro­ken up into small pieces by a trusted sys­tem, and the small pieces ac­com­plish small tasks. This way, you never have to run any large un­trusted evolu­tion-like search, and don’t have to worry about dae­mons.

The main prob­lem with things in this cluster is that they likely won’t work. How­ever, if I imag­ine they worked too well, and I had a sys­tem that ac­tu­ally had these types of re­stric­tions mak­ing it safe through­out, there is still a (be­nign) failure mode which I will re­fer to as lack of al­gorith­mic range. By this, I mean things like mak­ing a sys­tem that is not Tur­ing com­plete, and so can’t solve some hard prob­lems, or a prior that is not rich enough to con­tain the true world.

Mild­ness is an­other cluster of ap­proaches in AI safety, which is used to com­bat Dae­mons and Good­hart. Ap­proaches in this cluster in­clude Mild Op­ti­miza­tion, Im­pact Mea­sures, and Cor­rigi­bil­ity. They are all based on the fact that the world is already par­tially op­ti­mized for our val­ues (or vice versa), and too much op­ti­miza­tion can de­stroy that.

A cen­tral ex­am­ple of this is quan­tiliza­tion, which is a type of mild op­ti­miza­tion. We have a proxy which was ob­served to be good in the prior, un­op­ti­mized dis­tri­bu­tion of pos­si­ble out­comes. If we then op­ti­mize the out­come ac­cord­ing to that proxy, we will go to a sin­gle point with a high proxy value. There is no guaran­tee that that point will be good ac­cord­ing to the true value. With quan­tiliza­tion, we in­stead do some­thing like choose a point at ran­dom, ac­cord­ing to the un­op­ti­mized dis­tri­bu­tion from among the top one per­cent of pos­si­ble out­comes ac­cord­ing to the proxy. This al­lows us to trans­fer some guaran­tees from the un­op­ti­mized dis­tri­bu­tion to the fi­nal out­come.

Im­pact mea­sures are similarly only valuable be­cause the do-noth­ing ac­tion is spe­cial in that it is ob­served to be good for hu­mans. Cor­rigi­bil­ity is largely about mak­ing sys­tems that are su­per­in­tel­li­gent with­out be­ing them­selves fully agen­tic. We want sys­tems that are will­ing to let hu­man op­er­a­tors fix them, in a way that doesn’t re­sult in op­ti­miz­ing the world for be­ing the perfect way to col­lect large amounts of feed­back from micro­scopic hu­mans. Fi­nally, note that one way to stop a search from cre­at­ing an op­ti­miza­tion dae­mon is to just not push it too hard.

The main prob­lem with this class of solu­tions is a lack of com­pet­i­tive­ness. It is easy to make a sys­tem that doesn’t op­ti­mize too hard. Just make a sys­tem that doesn’t do any­thing. The prob­lem is that we want a sys­tem that ac­tu­ally does stuff, par­tially be­cause it needs to keep up with other sys­tems that are grow­ing and do­ing things.