Concept Safety: Producing similar AI-human concept spaces

I’m cur­rently read­ing through some rele­vant liter­a­ture for prepar­ing my FLI grant pro­posal on the topic of con­cept learn­ing and AI safety. I figured that I might as well write down the re­search ideas I get while do­ing so, so as to get some feed­back and clar­ify my thoughts. I will post­ing these in a se­ries of “Con­cept Safety”-ti­tled ar­ti­cles.

A fre­quently-raised worry about AI is that it may rea­son in ways which are very differ­ent from us, and un­der­stand the world in a very alien man­ner. For ex­am­ple, Arm­strong, Sand­berg & Bostrom (2012) con­sider the pos­si­bil­ity of re­strict­ing an AI via “rule-based mo­ti­va­tional con­trol” and pro­gram­ming it to fol­low re­stric­tions like “stay within this lead box here”, but they raise wor­ries about the difficulty of rigor­ously defin­ing “this lead box here”. To ad­dress this, they go on to con­sider the pos­si­bil­ity of mak­ing an AI in­ter­nal­ize hu­man con­cepts via feed­back, with the AI be­ing told whether or not some be­hav­ior is good or bad and then con­struct­ing a cor­re­spond­ing world-model based on that. The au­thors are how­ever wor­ried that this may fail, because

Hu­mans seem quite adept at con­struct­ing the cor­rect gen­er­al­i­sa­tions – most of us have cor­rectly de­duced what we should/​should not be do­ing in gen­eral situ­a­tions (whether or not we fol­low those rules). But hu­mans share a com­mon of ge­netic de­sign, which the OAI would likely not have. Shar­ing, for in­stance, de­rives par­tially from ge­netic pre­dis­po­si­tion to re­cip­ro­cal al­tru­ism: the OAI may not in­te­grate the same con­cept as a hu­man child would. Though re­in­force­ment learn­ing has a good track record, it is nei­ther a panacea nor a guaran­tee that the OAIs gen­er­al­i­sa­tions agree with ours.

Ad­dress­ing this, a pos­si­bil­ity that I raised in So­tala (2015) was that pos­si­bly the con­cept-learn­ing mechanisms in the hu­man brain are ac­tu­ally rel­a­tively sim­ple, and that we could repli­cate the hu­man con­cept learn­ing pro­cess by repli­cat­ing those rules. I’ll start this post by dis­cussing a closely re­lated hy­poth­e­sis: that given a spe­cific learn­ing or rea­son­ing task and a cer­tain kind of data, there is an op­ti­mal way to or­ga­nize the data that will nat­u­rally emerge. If this were the case, then AI and hu­man rea­son­ing might nat­u­rally tend to learn the same kinds of con­cepts, even if they were us­ing very differ­ent mechanisms. Later on the post, I will dis­cuss how one might try to ver­ify that similar rep­re­sen­ta­tions had in fact been learned, and how to set up a sys­tem to make them even more similar.

Word embedding

"Left panel shows vector offsets for three word pairs illustrating the gender relation. Right panel shows a different projection, and the singular/plural relation for two words. In high-dimensional space, multiple relations can be embedded for a single word." (Mikolov et al. 2013)A par­tic­u­larly fas­ci­nat­ing branch of re­cent re­search re­lates to the learn­ing of word em­bed­dings, which are map­pings of words to very high-di­men­sional vec­tors. It turns out that if you train a sys­tem on one of sev­eral kinds of tasks, such as be­ing able to clas­sify sen­tences as valid or in­valid, this builds up a space of word vec­tors that re­flects the re­la­tion­ships be­tween the words. For ex­am­ple, there seems to be a male/​fe­male di­men­sion to words, so that there’s a “fe­male vec­tor” that we can add to the word “man” to get “woman”—or, equiv­a­lently, which we can sub­tract from “woman” to get “man”. And it so hap­pens (Mikolov, Yih & Zweig 2013) that we can also get from the word “king” to the word “queen” by adding the same vec­tor to “king”. In gen­eral, we can (roughly) get to the male/​fe­male ver­sion of any word vec­tor by adding or sub­tract­ing this one differ­ence vec­tor!

Why would this hap­pen? Well, a learner that needs to clas­sify sen­tences as valid or in­valid needs to clas­sify the sen­tence “the king sat on his throne” as valid while clas­sify­ing the sen­tence “the king sat on her throne” as in­valid. So in­clud­ing a gen­der di­men­sion on the built-up rep­re­sen­ta­tion makes sense.

But gen­der isn’t the only kind of re­la­tion­ship that gets re­flected in the ge­om­e­try of the word space. Here are a few more:

It turns out (Mikolov et al. 2013) that with the right kind of train­ing mechanism, a lot of re­la­tion­ships that we’re in­tu­itively aware of be­come au­to­mat­i­cally learned and rep­re­sented in the con­cept ge­om­e­try. And like Olah (2014) com­ments:

It’s im­por­tant to ap­pre­ci­ate that all of these prop­er­ties of W are side effects. We didn’t try to have similar words be close to­gether. We didn’t try to have analo­gies en­coded with differ­ence vec­tors. All we tried to do was perform a sim­ple task, like pre­dict­ing whether a sen­tence was valid. Th­ese prop­er­ties more or less popped out of the op­ti­miza­tion pro­cess.

This seems to be a great strength of neu­ral net­works: they learn bet­ter ways to rep­re­sent data, au­to­mat­i­cally. Rep­re­sent­ing data well, in turn, seems to be es­sen­tial to suc­cess at many ma­chine learn­ing prob­lems. Word em­bed­dings are just a par­tic­u­larly strik­ing ex­am­ple of learn­ing a rep­re­sen­ta­tion.

It gets even more in­ter­est­ing, for we can use these for trans­la­tion. Since Olah has already writ­ten an ex­cel­lent ex­po­si­tion of this, I’ll just quote him:

We can learn to em­bed words from two differ­ent lan­guages in a sin­gle, shared space. In this case, we learn to em­bed English and Man­darin Chi­nese words in the same space.

We train two word em­bed­dings, Wen and Wzh in a man­ner similar to how we did above. How­ever, we know that cer­tain English words and Chi­nese words have similar mean­ings. So, we op­ti­mize for an ad­di­tional prop­erty: words that we know are close trans­la­tions should be close to­gether.

Of course, we ob­serve that the words we knew had similar mean­ings end up close to­gether. Since we op­ti­mized for that, it’s not sur­pris­ing. More in­ter­est­ing is that words we didn’t know were trans­la­tions end up close to­gether.

In light of our pre­vi­ous ex­pe­riences with word em­bed­dings, this may not seem too sur­pris­ing. Word em­bed­dings pull similar words to­gether, so if an English and Chi­nese word we know to mean similar things are near each other, their syn­onyms will also end up near each other. We also know that things like gen­der differ­ences tend to end up be­ing rep­re­sented with a con­stant differ­ence vec­tor. It seems like forc­ing enough points to line up should force these differ­ence vec­tors to be the same in both the English and Chi­nese em­bed­dings. A re­sult of this would be that if we know that two male ver­sions of words trans­late to each other, we should also get the fe­male words to trans­late to each other.

In­tu­itively, it feels a bit like the two lan­guages have a similar ‘shape’ and that by forc­ing them to line up at differ­ent points, they over­lap and other points get pul­led into the right po­si­tions.

After this, it gets even more in­ter­est­ing. Sup­pose you had this space of word vec­tors, and then you also had a sys­tem which trans­lated images into vec­tors in the same space. If you have images of dogs, you put them near the word vec­tor for dog. If you have images of Clippy you put them near word vec­tor for “pa­per­clip”. And so on.

You do that, and then you take some class of images the image-clas­sifier was never trained on, like images of cats. You ask it to place the cat-image some­where in the vec­tor space. Where does it end up?

You guessed it: in the rough re­gion of the “cat” words. Olah once more:

This was done by mem­bers of the Stan­ford group with only 8 known classes (and 2 un­known classes). The re­sults are already quite im­pres­sive. But with so few known classes, there are very few points to in­ter­po­late the re­la­tion­ship be­tween images and se­man­tic space off of.

The Google group did a much larger ver­sion – in­stead of 8 cat­e­gories, they used 1,000 – around the same time (Frome et al. (2013)) and has fol­lowed up with a new vari­a­tion (Norouzi et al. (2014)). Both are based on a very pow­er­ful image clas­sifi­ca­tion model (from Krize­hvsky et al. (2012)), but em­bed images into the word em­bed­ding space in differ­ent ways.

The re­sults are im­pres­sive. While they may not get images of un­known classes to the pre­cise vec­tor rep­re­sent­ing that class, they are able to get to the right neigh­bor­hood. So, if you ask it to clas­sify images of un­known classes and the classes are fairly differ­ent, it can dis­t­in­guish be­tween the differ­ent classes.

Even though I’ve never seen a Aes­cu­lapian snake or an Ar­madillo be­fore, if you show me a pic­ture of one and a pic­ture of the other, I can tell you which is which be­cause I have a gen­eral idea of what sort of an­i­mal is as­so­ci­ated with each word. Th­ese net­works can ac­com­plish the same thing.

Th­ese al­gorithms made no at­tempt of be­ing biolog­i­cally re­al­is­tic in any way. They didn’t try clas­sify­ing data the way the brain does it: they just tried clas­sify­ing data us­ing what­ever worked. And it turned out that this was enough to start con­struct­ing a mul­ti­modal rep­re­sen­ta­tion space where a lot of the re­la­tion­ships be­tween en­tities were similar to the way hu­mans un­der­stand the world.

How use­ful is this?

“Well, that’s cool”, you might now say. “But those word spaces were con­structed from hu­man lin­guis­tic data, for the pur­pose of pre­dict­ing hu­man sen­tences. Of course they’re go­ing to clas­sify the world in the same way as hu­mans do: they’re ba­si­cally learn­ing the hu­man rep­re­sen­ta­tion of the world. That doesn’t mean that an au­tonomously learn­ing AI, with its own learn­ing fac­ul­ties and sys­tems, is nec­es­sar­ily go­ing to learn a similar in­ter­nal rep­re­sen­ta­tion, or to have similar con­cepts.”

This is a fair crit­i­cism. But it is mildly sug­ges­tive of the pos­si­bil­ity that an AI that was trained to un­der­stand the world via feed­back from hu­man op­er­a­tors would end up build­ing a similar con­cep­tual space. At least as­sum­ing that we chose the right learn­ing al­gorithms.

When we train a lan­guage model to clas­sify sen­tences by la­bel­ing some of them as valid and oth­ers as in­valid, there’s a hid­den struc­ture im­plicit in our an­swers: the struc­ture of how we un­der­stand the world, and of how we think of the mean­ing of words. The lan­guage model ex­tracts that hid­den struc­ture and be­gins to clas­sify pre­vi­ously un­seen things in terms of those im­plicit rea­son­ing pat­terns. Similarly, if we gave an AI feed­back about what kinds of ac­tions counted as “leav­ing the box” and which ones didn’t, there would be a cer­tain way of view­ing and con­cep­tu­al­iz­ing the world im­plied by that feed­back, one which the AI could learn.

Com­par­ing representations

“Hmm, maaaaaaaaybe”, is your skep­ti­cal an­swer. “But how would you ever know? Like, you can test the AI in your train­ing situ­a­tion, but how do you know that it’s ac­tu­ally ac­quired a similar-enough rep­re­sen­ta­tion and not some­thing wildly off? And it’s one thing to look at those vec­tor spaces and claim that there are hu­man-like re­la­tion­ships among the differ­ent items, but that’s still a lit­tle hand-wavy. We don’t ac­tu­ally know that the hu­man brain does any­thing re­motely similar to rep­re­sent con­cepts.”

Here we turn, for a mo­ment, to neu­ro­science.

From Kaplan, Man & Greening (2015): "In this example, subjects either see or touch two classes of objects, apples and bananas. (A) First, a classifier is trained on the labeled patterns of neural activity evoked by seeing the two objects. (B) Next, the same classifier is given unlabeled data from when the subject touches the same objects and makes a prediction. If the classifier, which was trained on data from vision, can correctly identify the patterns evoked by touch, then we conclude that the representation is modality invariant."Mul­ti­vari­ate Cross-Clas­sifi­ca­tion (MVCC) is a clever neu­ro­science method­ol­ogy used for figur­ing out whether differ­ent neu­ral rep­re­sen­ta­tions of the same thing have some­thing in com­mon. For ex­am­ple, we may be in­ter­ested in whether the vi­sual and tac­tile rep­re­sen­ta­tion of a ba­nana have some­thing in com­mon.

We can test this by hav­ing sev­eral test sub­jects look at pic­tures of ob­jects such as ap­ples and ba­nanas while sit­ting in a brain scan­ner. We then feed the scans of their brains into a ma­chine learn­ing clas­sifier and teach it to dis­t­in­guish be­tween the neu­ral ac­tivity of look­ing at an ap­ple, ver­sus the neu­ral ac­tivity of look­ing at a ba­nana. Next we have our test sub­jects (still sit­ting in the brain scan­ners) touch some ba­nanas and ap­ples, and ask our ma­chine learn­ing clas­sifier to guess whether the re­sult­ing neu­ral ac­tivity is the re­sult of touch­ing a ba­nana or an ap­ple. If the clas­sifier—which has not been trained on the “touch” rep­re­sen­ta­tions, only on the “sight” rep­re­sen­ta­tions—man­ages to achieve a bet­ter-than-chance perfor­mance on this lat­ter task, then we can con­clude that the neu­ral rep­re­sen­ta­tion for e.g. “the sight of a ba­nana” has some­thing in com­mon with the neu­ral rep­re­sen­ta­tion for “the touch of a ba­nana”.

A par­tic­u­larly fas­ci­nat­ing ex­per­i­ment of this type is that of Shinkareva et al. (2011), who showed their test sub­jects both the writ­ten words for differ­ent tools and dwellings, and, sep­a­rately, line-draw­ing images of the same tools and dwellings. A ma­chine-learn­ing clas­sifier was both trained on image-evoked ac­tivity and made to pre­dict word-evoked ac­tivity and vice versa, and achieved a high ac­cu­racy on cat­e­gory clas­sifi­ca­tion for both tasks. Even more in­ter­est­ingly, the rep­re­sen­ta­tions seemed to be similar be­tween sub­jects. Train­ing the clas­sifier on the word rep­re­sen­ta­tions of all but one par­ti­ci­pant, and then hav­ing it clas­sify the image rep­re­sen­ta­tion of the left-out par­ti­ci­pant, also achieved a re­li­able (p<0.05) cat­e­gory clas­sifi­ca­tion for 8 out of 12 par­ti­ci­pants. This sug­gests a rel­a­tively similar con­cept space be­tween hu­mans of a similar back­ground.

We can now hy­poth­e­size some ways of test­ing the similar­ity of the AI’s con­cept space with that of hu­mans. Pos­si­bly the most in­ter­est­ing one might be to de­velop a trans­la­tion be­tween a hu­man’s and an AI’s in­ter­nal rep­re­sen­ta­tions of con­cepts. Take a hu­man’s neu­ral ac­ti­va­tion when they’re think­ing of some con­cept, and then take the AI’s in­ter­nal ac­ti­va­tion when it is think­ing of the same con­cept, and plot them in a shared space similar to the English-Man­darin trans­la­tion. To what ex­tent do the two con­cept ge­ome­tries have similar shapes, al­low­ing one to take a hu­man’s neu­ral ac­ti­va­tion of the word “cat” to find the AI’s in­ter­nal rep­re­sen­ta­tion of the word “cat”? To the ex­tent that this is pos­si­ble, one could prob­a­bly es­tab­lish that the two share highly similar con­cept sys­tems.

One could also try to more ex­plic­itly op­ti­mize for such a similar­ity. For in­stance, one could train the AI to make pre­dic­tions of differ­ent con­cepts, with the ad­di­tional con­straint that its in­ter­nal rep­re­sen­ta­tion must be such that a ma­chine-learn­ing clas­sifier trained on a hu­man’s neu­ral rep­re­sen­ta­tions will cor­rectly iden­tify con­cept-clusters within the AI. This might force in­ter­nal similar­i­ties on the rep­re­sen­ta­tion be­yond the ones that would already be formed from similar­i­ties in the data.

Next post in se­ries: The prob­lem of alien con­cepts.