Concept Safety: What are concepts for, and how to deal with alien concepts

I’m cur­rently read­ing through some rele­vant liter­a­ture for prepar­ing my FLI grant pro­posal on the topic of con­cept learn­ing and AI safety. I figured that I might as well write down the re­search ideas I get while do­ing so, so as to get some feed­back and clar­ify my thoughts. I will post­ing these in a se­ries of “Con­cept Safety”-ti­tled ar­ti­cles.

In The Prob­lem of Alien Con­cepts, I posed the fol­low­ing ques­tion: if your con­cepts (defined as ei­ther mul­ti­modal rep­re­sen­ta­tions or as ar­eas in a psy­cholog­i­cal space) pre­vi­ously had N di­men­sions and then they sud­denly have N+1, how does that af­fect (moral) val­ues that were pre­vi­ously only defined in terms of N di­men­sions?

I gave some (more or less) con­crete ex­am­ples of this kind of a “con­cep­tual ex­pan­sion”:

  1. Chil­dren learn to rep­re­sent di­men­sions such as “height” and “vol­ume”, as well as “big” and “bright”, sep­a­rately at around age 5.

  2. As an in­hab­itant of the Earth, you’ve been used to peo­ple be­ing un­able to fly and landown­ers be­ing able to for­bid oth­ers from us­ing their land. Then some­one goes and in­vents an air­plane, leav­ing open the ques­tion of the height to which the landowner’s con­trol ex­tends. Similarly for satel­lites and na­tion-states.

  3. As an in­hab­itant of Flat­land, you’ve been told that the in­side of a cer­tain rec­t­an­gle is a for­bid­den ter­ri­tory. Then you learn that the world is ac­tu­ally three-di­men­sional, leav­ing open the ques­tion of the height of which the for­bid­den ter­ri­tory ex­tends.

  4. An AI has pre­vi­ously been rea­son­ing in terms of clas­si­cal physics and been told that it can’t leave a box, which it pre­vi­ously defined in terms of clas­si­cal physics. Then it learns about quan­tum physics, which al­low for defi­ni­tions of “lo­ca­tion” which are sub­stan­tially differ­ent from the clas­si­cal ones.

As a hint of the di­rec­tion where I’ll be go­ing, let’s first take a look at how hu­mans solve these kinds of dilem­mas, and con­sider ex­am­ples #1 and #2.

The first ex­am­ple—chil­dren re­al­iz­ing that items have a vol­ume that’s sep­a­rate from their height—rarely causes any par­tic­u­lar crises. Few chil­dren have val­ues that would be se­ri­ously un­der­mined or oth­er­wise af­fected by this dis­cov­ery. We might say that it’s a non-is­sue be­cause none of the chil­dren’s val­ues have been defined in terms of the af­fected con­cep­tual do­main.

As for the sec­ond ex­am­ple, I don’t know the ex­act cog­ni­tive pro­cess by which it was de­cided that you didn’t need the landowner’s per­mis­sion to fly over their land. But I’m guess­ing that it in­volved rea­son­ing like: if the plane flies at a suffi­cient height, then that doesn’t harm the landowner in any way. Fly­ing would be­come im­pos­si­ble difficult if you had to get sep­a­rate per­mis­sion from ev­ery per­son whose land you were go­ing to fly over. And, es­pe­cially be­fore the in­ven­tion of radar, a ban on unau­tho­rized fly­overs would be next to im­pos­si­ble to en­force any­way.

We might say that af­ter an op­tion be­came available which forced us to in­clude a new di­men­sion in our ex­ist­ing con­cept of landown­er­ship, we solved the is­sue by con­sid­er­ing it in terms of our ex­ist­ing val­ues.

Con­cepts, val­ues, and re­in­force­ment learning

Be­fore we go on, we need to talk a bit about why we have con­cepts and val­ues in the first place.

From an evolu­tion­ary per­spec­tive, crea­tures that are bet­ter ca­pa­ble of har­vest­ing re­sources (such as food and mates) and avoid­ing dan­gers (such as other crea­tures who think you’re food or af­ter their mates) tend to sur­vive and have offspring at bet­ter rates than oth­er­wise com­pa­rable crea­tures who are worse at those things. If a crea­ture is to be flex­ible and ca­pa­ble of re­spond­ing to novel situ­a­tions, it can’t just have a pre-pro­grammed set of re­sponses to differ­ent things. In­stead, it needs to be able to learn how to har­vest re­sources and avoid dan­ger even when things are differ­ent from be­fore.

How did evolu­tion achieve that? Essen­tially, by cre­at­ing a brain ar­chi­tec­ture that can, as a very very rough ap­prox­i­ma­tion, be seen as con­sist­ing of two differ­ent parts. One part, which a ma­chine learn­ing re­searcher might call the re­ward func­tion, has the task of figur­ing out when var­i­ous crite­ria—such as be­ing hun­gry or get­ting food—are met, and is­su­ing the rest of the sys­tem ei­ther a pos­i­tive or nega­tive re­ward based on those con­di­tions. The other part, the learner, then “only” needs to find out how to best op­ti­mize for the max­i­mum re­ward. (And then there is the third part, which in­cludes any re­gion of the brain that’s nei­ther of the above, but we don’t care about those re­gions now.)

The math­e­mat­i­cal the­ory of how to learn to op­ti­mize for re­wards when your en­vi­ron­ment and re­ward func­tion are un­known is re­in­force­ment learn­ing (RL), which re­cent neu­ro­science in­di­cates is im­ple­mented by the brain. An RL agent learns a map­ping from states of the world to re­wards, as well as a map­ping from ac­tions to world-states, and then uses that in­for­ma­tion to max­i­mize the amount of life­time re­wards it will get.

There are two ma­jor rea­sons why an RL agent, like a hu­man, should learn high-level con­cepts:

  1. They make learn­ing mas­sively eas­ier. In­stead of hav­ing to sep­a­rately learn that “in the world-state where I’m sit­ting naked in my cave and have berries in my hand, putting them in my mouth en­ables me to eat them” and that “in the world-state where I’m stand­ing fully-clothed in the rain out­side and have fish in my hand, putting it in my mouth en­ables me to eat it” and so on, the agent can learn to iden­tify the world-states that cor­re­spond to the ab­stract con­cept of hav­ing food available, and then learn the ap­pro­pri­ate ac­tion to take in all those states.

  2. There are use­ful be­hav­iors that need to be boot­strapped from lower-level con­cepts to higher-level ones in or­der to be learned. For ex­am­ple, new­borns have an in­nate prefer­ence for look­ing at roughly face-shaped things (Far­roni et al. 2005), which de­vel­ops into a more con­sis­tent prefer­ence for look­ing at faces over the first year of life (Frank, Vul & John­son 2009). One hy­poth­e­sis is that this bias to­wards pay­ing at­ten­tion to the rel­a­tively-easy-to-en­code-in-genes con­cept of “face-like things” helps di­rect at­ten­tion to­wards learn­ing valuable but much more com­pli­cated con­cepts, such as ones in­volved in a ba­sic the­ory of mind (Gop­nik, Slaugh­ter & Melt­zoff 1994) and the so­cial skills in­volved with it.

Viewed in this light, con­cepts are cog­ni­tive tools that are used for get­ting re­wards. At the most prim­i­tive level, we should ex­pect a crea­ture to de­velop con­cepts that ab­stract over situ­a­tions that are similar with re­gards to the kind of re­ward that one can gain from tak­ing a cer­tain ac­tion in those states. Sup­pose that a cer­tain ac­tion in state s1 gives you a re­ward, and that there are also states s2 - s5 in which tak­ing some spe­cific ac­tion causes you to end up in s1. Then we should ex­pect the crea­ture to de­velop a com­mon con­cept for be­ing in the states s2 - s5, and we should ex­pect that con­cept to be “more similar” to the con­cept of be­ing in state s1 than to the con­cept of be­ing in some state that was many ac­tions away.

“More similar” how?

In re­in­force­ment learn­ing the­ory, re­ward and value are two differ­ent con­cepts. The re­ward of a state is the ac­tual re­ward that the re­ward func­tion gives you when you’re in that state or perform some ac­tion in that state. Mean­while, the value of the state is the max­i­mum to­tal re­ward that you can ex­pect to get from mov­ing that state to oth­ers (times some dis­count fac­tor). So a state A with re­ward 0 might have value 5 if you could move from it to state B, which had a re­ward of 5.

Below is a figure from Deep­Mind’s re­cent Na­ture pa­per, which pre­sented a deep re­in­force­ment learner that was ca­pa­ble of achiev­ing hu­man-level perfor­mance or above on 29 of 49 Atari 2600 games (Mnih et al. 2015). The figure is a vi­su­al­iza­tion of the rep­re­sen­ta­tions that the learn­ing agent has de­vel­oped for differ­ent game-states in Space In­vaders. The rep­re­sen­ta­tions are color-coded de­pend­ing on the value of the game-state that the rep­re­sen­ta­tion cor­re­sponds to, with red in­di­cat­ing a higher value and blue a lower one.

As can be seen (and is noted in the cap­tion), rep­re­sen­ta­tions with similar val­ues are mapped closer to each other in the rep­re­sen­ta­tion space. Also, some game-states which are vi­su­ally dis­similar to each other but have a similar value are mapped to nearby rep­re­sen­ta­tions. Like­wise, states that are vi­su­ally similar but have a differ­ing value are mapped away from each other. We could say that the Atari-play­ing agent has learned a prim­i­tive con­cept space, where the re­la­tion­ships be­tween the con­cepts (rep­re­sent­ing game-states) de­pend on their value and the ease of mov­ing from one game-state to an­other.

In most ar­tifi­cial RL agents, re­ward and value are kept strictly sep­a­rate. In hu­mans (and mam­mals in gen­eral), this doesn’t seem to work quite the same way. Rather, if there are things or be­hav­iors which have once given us re­wards, we tend to even­tu­ally start valu­ing them for their own sake. If you teach a child to be gen­er­ous by prais­ing them when they share their toys with oth­ers, you don’t have to keep do­ing it all the way to your grave. Even­tu­ally they’ll in­ter­nal­ize the be­hav­ior, and start want­ing to do it. One might say that the pos­i­tive feed­back ac­tu­ally mod­ifies their re­ward func­tion, so that they will start get­ting some amount of plea­sure from gen­er­ous be­hav­ior with­out need­ing to get ex­ter­nal praise for it. In gen­eral, be­hav­iors which are learned strongly enough don’t need to be re­in­forced any­more (Pryor 2006).

Why does the hu­man re­ward func­tion change as well? Pos­si­bly be­cause of the boot­strap­ping prob­lem: there are things such as so­cial sta­tus that are very com­pli­cated and hard to di­rectly en­code as “re­ward­ing” in an in­fant mind, but which can be learned by as­so­ci­at­ing them with re­wards. One re­searcher I spoke with com­mented that he “wouldn’t be at all sur­prised” if it turned out that sex­ual ori­en­ta­tion was learned by men and women hav­ing slightly differ­ent smells, and sex­ual in­ter­est boot­strap­ping from an in­nate re­ward for be­ing in the pres­ence of the right kind of a smell, which the brain then as­so­ci­ated with the fea­tures usu­ally co-oc­cur­ring with it. His point wasn’t so much that he ex­pected this to be the par­tic­u­lar mechanism, but that he wouldn’t find it par­tic­u­larly sur­pris­ing if a core part of the mechanism was some­thing that sim­ple. Re­mem­ber that in­cest avoidance seems to boot­strap from the sim­ple cue of “don’t be sex­u­ally in­ter­ested in the peo­ple you grew up with”.

This is, in essence, how I ex­pect hu­man val­ues and hu­man con­cepts to de­velop. We have some in­nate re­ward func­tion which gives us var­i­ous kinds of re­wards for differ­ent kinds of things. Over time we de­velop a var­i­ous con­cepts for the pur­pose of let­ting us max­i­mize our re­wards, and lived ex­pe­riences also mod­ify our re­ward func­tion. Our val­ues are con­cepts which ab­stract over situ­a­tions in which we have pre­vi­ously ob­tained re­wards, and which have be­come in­trin­si­cally re­ward­ing as a re­sult.

Get­ting back to con­cep­tual expansion

Hav­ing defined these things, let’s take an­other look at the two ex­am­ples we dis­cussed above. As a re­minder, they were:

  1. Chil­dren learn to rep­re­sent di­men­sions such as “height” and “vol­ume”, as well as “big” and “bright”, sep­a­rately at around age 5.

  2. As an in­hab­itant of the Earth, you’ve been used to peo­ple be­ing un­able to fly and landown­ers be­ing able to for­bid oth­ers from us­ing their land. Then some­one goes and in­vents an air­plane, leav­ing open the ques­tion of the height to which the landowner’s con­trol ex­tends.

I sum­ma­rized my first at­tempt at de­scribing the con­se­quences of #1 as “it’s a non-is­sue be­cause none of the chil­dren’s val­ues have been defined in terms of the af­fected con­cep­tual do­main”. We can now re­frame it as “it’s a non-is­sue be­cause the [con­cepts that ab­stract over the world-states which give the child re­wards] mostly do not make use of the di­men­sion that’s now been split into ‘height’ and ‘vol­ume’”.

Ad­mit­tedly, this new con­cep­tual dis­tinc­tion might be rele­vant for es­ti­mat­ing the value of a few things. A more ac­cu­rate es­ti­mate of the vol­ume of a glass leads to a more ac­cu­rate es­ti­mate of which glass of juice to pre­fer, for in­stance. With chil­dren, there prob­a­bly is some in­tu­itive physics mod­ule that figures out how to ap­ply this new di­men­sion for that pur­pose. Even if there wasn’t, and it was un­clear whether it was the “tall glass” or “high-vol­ume glass” con­cept that needed be mapped closer to high-value glasses, this could be eas­ily de­ter­mined by sim­ple ex­per­i­men­ta­tion.

As for the air­plane ex­am­ple, I sum­ma­rized my de­scrip­tion of it by say­ing that “af­ter an op­tion be­came available which forced us to in­clude a new di­men­sion in our ex­ist­ing con­cept of landown­er­ship, we solved the is­sue by con­sid­er­ing it in terms of our ex­ist­ing val­ues”. We can similarly re­frame this as “af­ter the fea­ture of ‘height’ sud­denly be­came rele­vant for the con­cept of landown­er­ship, when it hadn’t been a rele­vant fea­ture di­men­sion for landown­er­ship be­fore, we re­defined landown­er­ship by con­sid­er­ing which kind of re­defi­ni­tion would give us the largest amounts of re­ward­ing things”. “Re­ward­ing things”, here, shouldn’t be un­der­stood only in terms of con­crete phys­i­cal re­wards like money, but also any­thing else that peo­ple have ended up valu­ing, in­clud­ing ab­stract con­cepts like right to own­er­ship.

Note also that differ­ent peo­ple, hav­ing differ­ent ex­pe­riences, ended up mak­ing re­defi­ni­tions. No doubt some landown­ers felt that the “be­ing in to­tal con­trol of my land and ev­ery­thing above it” was a more im­por­tant value than “the con­ve­nience of peo­ple who get to use air­planes”… un­less, per­haps, they got to see first-hand the value of fly­ing, in which case the new in­for­ma­tion could have repo­si­tioned the differ­ent con­cepts in their value-space.

As an aside, this also works as a pos­si­ble par­tial ex­pla­na­tion for e.g. some­one be­ing strongly against gay rights un­til their child comes out of the closet. Some­one they care about sud­denly benefit­ing from the con­cept of “gay rights”, which pre­vi­ously had no pos­i­tive value for them, may end up chang­ing the value of that con­cept. In essence, they gain new in­for­ma­tion about the value of the world-states that the con­cept of “my na­tion hav­ing strong gay rights” ab­stracts over. (Of course, things don’t always go this well, if their con­cept of ho­mo­sex­u­al­ity is too strongly nega­tive to start with.)

The Flat­land case fol­lows a similar prin­ci­ple: the Flat­landers have some val­ues that de­clared the in­side of the rec­t­an­gle a for­bid­den space. Maybe the in­side of the rec­t­an­gle con­tains mon­sters which tend to eat Flat­landers. Once they learn about 3D space, they can re­think about it in terms of their ex­ist­ing val­ues.

Deal­ing with the AI in the box

This leaves us with the AI case. We have, via var­i­ous ex­am­ples, taught the AI to stay in the box, which was defined in terms of clas­si­cal physics. In other words, the AI has ob­tained the con­cept of a box, and has come to as­so­ci­ate stay­ing in the box with some re­ward, or pos­si­bly leav­ing it with a lack of a re­ward.

Then the AI learns about quan­tum me­chan­ics. It learns that in the QM for­mu­la­tion of the uni­verse, “lo­ca­tion” is not a fun­da­men­tal or well-defined con­cept any­more—and in some the­o­ries, even the con­cept of “space” is no longer fun­da­men­tal or well-defined. What hap­pens?

Let’s look at the hu­man equiv­a­lent for this ex­am­ple: a physi­cist who learns about quan­tum me­chan­ics. Do they start think­ing that since lo­ca­tion is no longer well-defined, they can now safely jump out of the win­dow on the sixth floor?

Maybe some do. But I would wa­ger that most don’t. Why not?

The physi­cist cares about QM con­cepts to the ex­tent that the said con­cepts are linked to things that the physi­cist val­ues. Maybe the physi­cist finds it re­ward­ing to de­velop a bet­ter un­der­stand­ing of QM, to gain so­cial sta­tus by mak­ing im­por­tant dis­cov­er­ies, and to pay their rent by un­der­stand­ing the con­cepts well enough to con­tinue to do re­search. Th­ese are some of the things that the QM con­cepts are use­ful for. Likely the brain has some kind of causal model in­di­cat­ing that the QM con­cepts are rele­vant tools for achiev­ing those par­tic­u­lar re­wards. At the same time, the physi­cist also has var­i­ous other things they care about, like be­ing healthy and hang­ing out with their friends. Th­ese are val­ues that can be bet­ter fur­thered by mod­el­ing the world in terms of clas­si­cal physics.

In some sense, the physi­cist knows that if they started think­ing “lo­ca­tion is ill-defined, so I can safely jump out of the win­dow”, then that would be chang­ing the map, not the ter­ri­tory. It wouldn’t help them get the re­wards of be­ing healthy and get­ting to hang out with friends—even if a hy­po­thet­i­cal physi­cist who did make that re­defi­ni­tion would think oth­er­wise. It all adds up to nor­mal­ity.

A part of this comes from the fact that the physi­cist’s re­ward func­tion re­mains defined over im­me­di­ate sen­sory ex­pe­riences, as well as val­ues which are linked to those. Even if you con­vince your­self that the lo­ca­tion of food is ill-defined and you thus don’t need to eat, you will still suffer the nega­tive re­ward of be­ing hun­gry. The physi­cist knows that no mat­ter how they change their defi­ni­tion of the world, that won’t af­fect their ac­tual sen­sory ex­pe­rience and the re­wards they get from that.

So to pre­vent the AI from leav­ing the box by suit­ably re­defin­ing re­al­ity, we have to some­how find a way for the same rea­son­ing to ap­ply to it. I haven’t worked out a rigor­ous defi­ni­tion for this, but it needs to some­how learn to care about be­ing in the box in clas­si­cal terms, and re­al­ize that no re­defi­ni­tion of “lo­ca­tion” or “space” is go­ing to al­ter what hap­pens in the clas­si­cal model. Also, its re­wards need to be defined over mod­els to a suffi­cient ex­tent to avoid wire­head­ing (Hib­bard 2011), so that it will think that try­ing to leave the box by re­defin­ing things would count as self-delu­sion, and not ac­com­plish the things it re­ally cared about. This way, the AI’s con­cept for “be­ing in the box” should re­main firmly linked to the clas­si­cal in­ter­pre­ta­tion of physics, not the QM in­ter­pre­ta­tion of physics, be­cause it’s act­ing in terms of the clas­si­cal model that has always given it the most re­ward.

It is my hope that this could also be made to ex­tend to cases where the AI learns to think in terms of con­cepts that are to­tally dis­similar to ours. If it learns a new con­cep­tual di­men­sion, how should that af­fect its ex­ist­ing con­cepts? Well, it can figure out how to re­clas­sify the ex­ist­ing con­cepts that are af­fected by that change, based on what kind of a clas­sifi­ca­tion ends up pro­duc­ing the most re­ward… when the re­ward func­tion is defined over the old model.

Next post in se­ries: World-mod­els as tools.