Concept Safety: Producing similar AI-human concept spaces

I’m cur­rently read­ing through some rele­vant liter­a­ture for prepar­ing my FLI grant pro­posal on the topic of con­cept learn­ing and AI safety. I figured that I might as well write down the re­search ideas I get while do­ing so, so as to get some feed­back and clar­ify my thoughts. I will post­ing these in a se­ries of “Con­cept Safety”-ti­tled ar­ti­cles.

A fre­quently-raised worry about AI is that it may rea­son in ways which are very differ­ent from us, and un­der­stand the world in a very alien man­ner. For ex­am­ple, Arm­strong, Sand­berg & Bostrom (2012) con­sider the pos­si­bil­ity of re­strict­ing an AI via “rule-based mo­ti­va­tional con­trol” and pro­gram­ming it to fol­low re­stric­tions like “stay within this lead box here”, but they raise wor­ries about the difficulty of rigor­ously defin­ing “this lead box here”. To ad­dress this, they go on to con­sider the pos­si­bil­ity of mak­ing an AI in­ter­nal­ize hu­man con­cepts via feed­back, with the AI be­ing told whether or not some be­hav­ior is good or bad and then con­struct­ing a cor­re­spond­ing world-model based on that. The au­thors are how­ever wor­ried that this may fail, because

Hu­mans seem quite adept at con­struct­ing the cor­rect gen­er­al­i­sa­tions – most of us have cor­rectly de­duced what we should/​should not be do­ing in gen­eral situ­a­tions (whether or not we fol­low those rules). But hu­mans share a com­mon of ge­netic de­sign, which the OAI would likely not have. Shar­ing, for in­stance, de­rives par­tially from ge­netic pre­dis­po­si­tion to re­cip­ro­cal al­tru­ism: the OAI may not in­te­grate the same con­cept as a hu­man child would. Though re­in­force­ment learn­ing has a good track record, it is nei­ther a panacea nor a guaran­tee that the OAIs gen­er­al­i­sa­tions agree with ours.

Ad­dress­ing this, a pos­si­bil­ity that I raised in So­tala (2015) was that pos­si­bly the con­cept-learn­ing mechanisms in the hu­man brain are ac­tu­ally rel­a­tively sim­ple, and that we could repli­cate the hu­man con­cept learn­ing pro­cess by repli­cat­ing those rules. I’ll start this post by dis­cussing a closely re­lated hy­poth­e­sis: that given a spe­cific learn­ing or rea­son­ing task and a cer­tain kind of data, there is an op­ti­mal way to or­ga­nize the data that will nat­u­rally emerge. If this were the case, then AI and hu­man rea­son­ing might nat­u­rally tend to learn the same kinds of con­cepts, even if they were us­ing very differ­ent mechanisms. Later on the post, I will dis­cuss how one might try to ver­ify that similar rep­re­sen­ta­tions had in fact been learned, and how to set up a sys­tem to make them even more similar.

Word embedding

A par­tic­u­larly fas­ci­nat­ing branch of re­cent re­search re­lates to the learn­ing of word em­bed­dings, which are map­pings of words to very high-di­men­sional vec­tors. It turns out that if you train a sys­tem on one of sev­eral kinds of tasks, such as be­ing able to clas­sify sen­tences as valid or in­valid, this builds up a space of word vec­tors that re­flects the re­la­tion­ships be­tween the words. For ex­am­ple, there seems to be a male/​fe­male di­men­sion to words, so that there’s a “fe­male vec­tor” that we can add to the word “man” to get “woman”—or, equiv­a­lently, which we can sub­tract from “woman” to get “man”. And it so hap­pens (Mikolov, Yih & Zweig 2013) that we can also get from the word “king” to the word “queen” by adding the same vec­tor to “king”. In gen­eral, we can (roughly) get to the male/​fe­male ver­sion of any word vec­tor by adding or sub­tract­ing this one differ­ence vec­tor!

Why would this hap­pen? Well, a learner that needs to clas­sify sen­tences as valid or in­valid needs to clas­sify the sen­tence “the king sat on his throne” as valid while clas­sify­ing the sen­tence “the king sat on her throne” as in­valid. So in­clud­ing a gen­der di­men­sion on the built-up rep­re­sen­ta­tion makes sense.

But gen­der isn’t the only kind of re­la­tion­ship that gets re­flected in the ge­om­e­try of the word space. Here are a few more:

It turns out (Mikolov et al. 2013) that with the right kind of train­ing mechanism, a lot of re­la­tion­ships that we’re in­tu­itively aware of be­come au­to­mat­i­cally learned and rep­re­sented in the con­cept ge­om­e­try. And like Olah (2014) com­ments:

It’s im­por­tant to ap­pre­ci­ate that all of these prop­er­ties of W are side effects. We didn’t try to have similar words be close to­gether. We didn’t try to have analo­gies en­coded with differ­ence vec­tors. All we tried to do was perform a sim­ple task, like pre­dict­ing whether a sen­tence was valid. Th­ese prop­er­ties more or less popped out of the op­ti­miza­tion pro­cess.

This seems to be a great strength of neu­ral net­works: they learn bet­ter ways to rep­re­sent data, au­to­mat­i­cally. Rep­re­sent­ing data well, in turn, seems to be es­sen­tial to suc­cess at many ma­chine learn­ing prob­lems. Word em­bed­dings are just a par­tic­u­larly strik­ing ex­am­ple of learn­ing a rep­re­sen­ta­tion.

It gets even more in­ter­est­ing, for we can use these for trans­la­tion. Since Olah has already writ­ten an ex­cel­lent ex­po­si­tion of this, I’ll just quote him:

We can learn to em­bed words from two differ­ent lan­guages in a sin­gle, shared space. In this case, we learn to em­bed English and Man­darin Chi­nese words in the same space.

We train two word em­bed­dings, Wen and Wzh in a man­ner similar to how we did above. How­ever, we know that cer­tain English words and Chi­nese words have similar mean­ings. So, we op­ti­mize for an ad­di­tional prop­erty: words that we know are close trans­la­tions should be close to­gether.

Of course, we ob­serve that the words we knew had similar mean­ings end up close to­gether. Since we op­ti­mized for that, it’s not sur­pris­ing. More in­ter­est­ing is that words we didn’t know were trans­la­tions end up close to­gether.

In light of our pre­vi­ous ex­pe­riences with word em­bed­dings, this may not seem too sur­pris­ing. Word em­bed­dings pull similar words to­gether, so if an English and Chi­nese word we know to mean similar things are near each other, their syn­onyms will also end up near each other. We also know that things like gen­der differ­ences tend to end up be­ing rep­re­sented with a con­stant differ­ence vec­tor. It seems like forc­ing enough points to line up should force these differ­ence vec­tors to be the same in both the English and Chi­nese em­bed­dings. A re­sult of this would be that if we know that two male ver­sions of words trans­late to each other, we should also get the fe­male words to trans­late to each other.

In­tu­itively, it feels a bit like the two lan­guages have a similar ‘shape’ and that by forc­ing them to line up at differ­ent points, they over­lap and other points get pul­led into the right po­si­tions.

After this, it gets even more in­ter­est­ing. Sup­pose you had this space of word vec­tors, and then you also had a sys­tem which trans­lated images into vec­tors in the same space. If you have images of dogs, you put them near the word vec­tor for dog. If you have images of Clippy you put them near word vec­tor for “pa­per­clip”. And so on.

You do that, and then you take some class of images the image-clas­sifier was never trained on, like images of cats. You ask it to place the cat-image some­where in the vec­tor space. Where does it end up?

You guessed it: in the rough re­gion of the “cat” words. Olah once more:

This was done by mem­bers of the Stan­ford group with only 8 known classes (and 2 un­known classes). The re­sults are already quite im­pres­sive. But with so few known classes, there are very few points to in­ter­po­late the re­la­tion­ship be­tween images and se­man­tic space off of.

The Google group did a much larger ver­sion – in­stead of 8 cat­e­gories, they used 1,000 – around the same time (Frome et al. (2013)) and has fol­lowed up with a new vari­a­tion (Norouzi et al. (2014)). Both are based on a very pow­er­ful image clas­sifi­ca­tion model (from Krize­hvsky et al. (2012)), but em­bed images into the word em­bed­ding space in differ­ent ways.

The re­sults are im­pres­sive. While they may not get images of un­known classes to the pre­cise vec­tor rep­re­sent­ing that class, they are able to get to the right neigh­bor­hood. So, if you ask it to clas­sify images of un­known classes and the classes are fairly differ­ent, it can dis­t­in­guish be­tween the differ­ent classes.

Even though I’ve never seen a Aes­cu­lapian snake or an Ar­madillo be­fore, if you show me a pic­ture of one and a pic­ture of the other, I can tell you which is which be­cause I have a gen­eral idea of what sort of an­i­mal is as­so­ci­ated with each word. Th­ese net­works can ac­com­plish the same thing.

Th­ese al­gorithms made no at­tempt of be­ing biolog­i­cally re­al­is­tic in any way. They didn’t try clas­sify­ing data the way the brain does it: they just tried clas­sify­ing data us­ing what­ever worked. And it turned out that this was enough to start con­struct­ing a mul­ti­modal rep­re­sen­ta­tion space where a lot of the re­la­tion­ships be­tween en­tities were similar to the way hu­mans un­der­stand the world.

How use­ful is this?

“Well, that’s cool”, you might now say. “But those word spaces were con­structed from hu­man lin­guis­tic data, for the pur­pose of pre­dict­ing hu­man sen­tences. Of course they’re go­ing to clas­sify the world in the same way as hu­mans do: they’re ba­si­cally learn­ing the hu­man rep­re­sen­ta­tion of the world. That doesn’t mean that an au­tonomously learn­ing AI, with its own learn­ing fac­ul­ties and sys­tems, is nec­es­sar­ily go­ing to learn a similar in­ter­nal rep­re­sen­ta­tion, or to have similar con­cepts.”

This is a fair crit­i­cism. But it is mildly sug­ges­tive of the pos­si­bil­ity that an AI that was trained to un­der­stand the world via feed­back from hu­man op­er­a­tors would end up build­ing a similar con­cep­tual space. At least as­sum­ing that we chose the right learn­ing al­gorithms.

When we train a lan­guage model to clas­sify sen­tences by la­bel­ing some of them as valid and oth­ers as in­valid, there’s a hid­den struc­ture im­plicit in our an­swers: the struc­ture of how we un­der­stand the world, and of how we think of the mean­ing of words. The lan­guage model ex­tracts that hid­den struc­ture and be­gins to clas­sify pre­vi­ously un­seen things in terms of those im­plicit rea­son­ing pat­terns. Similarly, if we gave an AI feed­back about what kinds of ac­tions counted as “leav­ing the box” and which ones didn’t, there would be a cer­tain way of view­ing and con­cep­tu­al­iz­ing the world im­plied by that feed­back, one which the AI could learn.

Com­par­ing representations

“Hmm, maaaaaaaaybe”, is your skep­ti­cal an­swer. “But how would you ever know? Like, you can test the AI in your train­ing situ­a­tion, but how do you know that it’s ac­tu­ally ac­quired a similar-enough rep­re­sen­ta­tion and not some­thing wildly off? And it’s one thing to look at those vec­tor spaces and claim that there are hu­man-like re­la­tion­ships among the differ­ent items, but that’s still a lit­tle hand-wavy. We don’t ac­tu­ally know that the hu­man brain does any­thing re­motely similar to rep­re­sent con­cepts.”

Here we turn, for a mo­ment, to neu­ro­science.

Mul­ti­vari­ate Cross-Clas­sifi­ca­tion (MVCC) is a clever neu­ro­science method­ol­ogy used for figur­ing out whether differ­ent neu­ral rep­re­sen­ta­tions of the same thing have some­thing in com­mon. For ex­am­ple, we may be in­ter­ested in whether the vi­sual and tac­tile rep­re­sen­ta­tion of a ba­nana have some­thing in com­mon.

We can test this by hav­ing sev­eral test sub­jects look at pic­tures of ob­jects such as ap­ples and ba­nanas while sit­ting in a brain scan­ner. We then feed the scans of their brains into a ma­chine learn­ing clas­sifier and teach it to dis­t­in­guish be­tween the neu­ral ac­tivity of look­ing at an ap­ple, ver­sus the neu­ral ac­tivity of look­ing at a ba­nana. Next we have our test sub­jects (still sit­ting in the brain scan­ners) touch some ba­nanas and ap­ples, and ask our ma­chine learn­ing clas­sifier to guess whether the re­sult­ing neu­ral ac­tivity is the re­sult of touch­ing a ba­nana or an ap­ple. If the clas­sifier—which has not been trained on the “touch” rep­re­sen­ta­tions, only on the “sight” rep­re­sen­ta­tions—man­ages to achieve a bet­ter-than-chance perfor­mance on this lat­ter task, then we can con­clude that the neu­ral rep­re­sen­ta­tion for e.g. “the sight of a ba­nana” has some­thing in com­mon with the neu­ral rep­re­sen­ta­tion for “the touch of a ba­nana”.

A par­tic­u­larly fas­ci­nat­ing ex­per­i­ment of this type is that of Shinkareva et al. (2011), who showed their test sub­jects both the writ­ten words for differ­ent tools and dwellings, and, sep­a­rately, line-draw­ing images of the same tools and dwellings. A ma­chine-learn­ing clas­sifier was both trained on image-evoked ac­tivity and made to pre­dict word-evoked ac­tivity and vice versa, and achieved a high ac­cu­racy on cat­e­gory clas­sifi­ca­tion for both tasks. Even more in­ter­est­ingly, the rep­re­sen­ta­tions seemed to be similar be­tween sub­jects. Train­ing the clas­sifier on the word rep­re­sen­ta­tions of all but one par­ti­ci­pant, and then hav­ing it clas­sify the image rep­re­sen­ta­tion of the left-out par­ti­ci­pant, also achieved a re­li­able (p<0.05) cat­e­gory clas­sifi­ca­tion for 8 out of 12 par­ti­ci­pants. This sug­gests a rel­a­tively similar con­cept space be­tween hu­mans of a similar back­ground.

We can now hy­poth­e­size some ways of test­ing the similar­ity of the AI’s con­cept space with that of hu­mans. Pos­si­bly the most in­ter­est­ing one might be to de­velop a trans­la­tion be­tween a hu­man’s and an AI’s in­ter­nal rep­re­sen­ta­tions of con­cepts. Take a hu­man’s neu­ral ac­ti­va­tion when they’re think­ing of some con­cept, and then take the AI’s in­ter­nal ac­ti­va­tion when it is think­ing of the same con­cept, and plot them in a shared space similar to the English-Man­darin trans­la­tion. To what ex­tent do the two con­cept ge­ome­tries have similar shapes, al­low­ing one to take a hu­man’s neu­ral ac­ti­va­tion of the word “cat” to find the AI’s in­ter­nal rep­re­sen­ta­tion of the word “cat”? To the ex­tent that this is pos­si­ble, one could prob­a­bly es­tab­lish that the two share highly similar con­cept sys­tems.

One could also try to more ex­plic­itly op­ti­mize for such a similar­ity. For in­stance, one could train the AI to make pre­dic­tions of differ­ent con­cepts, with the ad­di­tional con­straint that its in­ter­nal rep­re­sen­ta­tion must be such that a ma­chine-learn­ing clas­sifier trained on a hu­man’s neu­ral rep­re­sen­ta­tions will cor­rectly iden­tify con­cept-clusters within the AI. This might force in­ter­nal similar­i­ties on the rep­re­sen­ta­tion be­yond the ones that would already be formed from similar­i­ties in the data.

Next post in se­ries: The prob­lem of alien con­cepts.