Magical Categories

‘We can de­sign in­tel­li­gent ma­chines so their pri­mary, in­nate emo­tion is un­con­di­tional love for all hu­mans. First we can build rel­a­tively sim­ple ma­chines that learn to rec­og­nize hap­piness and un­hap­piness in hu­man fa­cial ex­pres­sions, hu­man voices and hu­man body lan­guage. Then we can hard-wire the re­sult of this learn­ing as the in­nate emo­tional val­ues of more com­plex in­tel­li­gent ma­chines, pos­i­tively re­in­forced when we are happy and nega­tively re­in­forced when we are un­happy.’
-- Bill Hib­bard (2001), Su­per-in­tel­li­gent ma­chines.

That was pub­lished in a peer-re­viewed jour­nal, and the au­thor later wrote a whole book about it, so this is not a straw­man po­si­tion I’m dis­cussing here.

So… um… what could pos­si­bly go wrong...

When I men­tioned (sec. 6) that Hib­bard’s AI ends up tiling the galaxy with tiny molec­u­lar smiley-faces, Hib­bard wrote an in­dig­nant re­ply say­ing:

‘When it is fea­si­ble to build a su­per-in­tel­li­gence, it will be fea­si­ble to build hard-wired recog­ni­tion of “hu­man fa­cial ex­pres­sions, hu­man voices and hu­man body lan­guage” (to use the words of mine that you quote) that ex­ceed the recog­ni­tion ac­cu­racy of cur­rent hu­mans such as you and me, and will cer­tainly not be fooled by “tiny molec­u­lar pic­tures of smiley-faces.” You should not as­sume such a poor im­ple­men­ta­tion of my idea that it can­not make dis­crim­i­na­tions that are triv­ial to cur­rent hu­mans.’

As Hib­bard also wrote “Such ob­vi­ous con­tra­dic­tory as­sump­tions show Yud­kowsky’s prefer­ence for drama over rea­son,” I’ll go ahead and men­tion that Hib­bard illus­trates a key point: There is no pro­fes­sional cer­tifi­ca­tion test you have to take be­fore you are al­lowed to talk about AI moral­ity. But that is not my pri­mary topic to­day. Though it is a cru­cial point about the state of the game­board, that most AGI/​FAI wannabes are so ut­terly un­suited to the task, that I know no one cyn­i­cal enough to imag­ine the hor­ror with­out see­ing it first­hand. Even Michael Vas­sar was prob­a­bly sur­prised his first time through.

No, to­day I am here to dis­sect “You should not as­sume such a poor im­ple­men­ta­tion of my idea that it can­not make dis­crim­i­na­tions that are triv­ial to cur­rent hu­mans.”

Once upon a time—I’ve seen this story in sev­eral ver­sions and sev­eral places, some­times cited as fact, but I’ve never tracked down an origi­nal source—once upon a time, I say, the US Army wanted to use neu­ral net­works to au­to­mat­i­cally de­tect cam­ou­flaged en­emy tanks.

The re­searchers trained a neu­ral net on 50 pho­tos of cam­ou­flaged tanks amid trees, and 50 pho­tos of trees with­out tanks. Us­ing stan­dard tech­niques for su­per­vised learn­ing, the re­searchers trained the neu­ral net­work to a weight­ing that cor­rectly loaded the train­ing set—out­put “yes” for the 50 pho­tos of cam­ou­flaged tanks, and out­put “no” for the 50 pho­tos of for­est.

Now this did not prove, or even im­ply, that new ex­am­ples would be clas­sified cor­rectly. The neu­ral net­work might have “learned” 100 spe­cial cases that wouldn’t gen­er­al­ize to new prob­lems. Not, “cam­ou­flaged tanks ver­sus for­est”, but just, “photo-1 pos­i­tive, photo-2 nega­tive, photo-3 nega­tive, photo-4 pos­i­tive...”

But wisely, the re­searchers had origi­nally taken 200 pho­tos, 100 pho­tos of tanks and 100 pho­tos of trees, and had used only half in the train­ing set. The re­searchers ran the neu­ral net­work on the re­main­ing 100 pho­tos, and with­out fur­ther train­ing the neu­ral net­work clas­sified all re­main­ing pho­tos cor­rectly. Suc­cess con­firmed!

The re­searchers handed the finished work to the Pen­tagon, which soon handed it back, com­plain­ing that in their own tests the neu­ral net­work did no bet­ter than chance at dis­crim­i­nat­ing pho­tos.

It turned out that in the re­searchers’ data set, pho­tos of cam­ou­flaged tanks had been taken on cloudy days, while pho­tos of plain for­est had been taken on sunny days. The neu­ral net­work had learned to dis­t­in­guish cloudy days from sunny days, in­stead of dis­t­in­guish­ing cam­ou­flaged tanks from empty for­est.

This parable—which might or might not be fact—illus­trates one of the most fun­da­men­tal prob­lems in the field of su­per­vised learn­ing and in fact the whole field of Ar­tifi­cial In­tel­li­gence: If the train­ing prob­lems and the real prob­lems have the slight­est differ­ence in con­text—if they are not drawn from the same in­de­pen­dently iden­ti­cally dis­tributed pro­cess—there is no statis­ti­cal guaran­tee from past suc­cess to fu­ture suc­cess. It doesn’t mat­ter if the AI seems to be work­ing great un­der the train­ing con­di­tions. (This is not an un­solv­able prob­lem but it is an un­patch­able prob­lem. There are deep ways to ad­dress it—a topic be­yond the scope of this post—but no bandaids.)

As de­scribed in Su­per­ex­po­nen­tial Con­ceptspace, there are ex­po­nen­tially more pos­si­ble con­cepts than pos­si­ble ob­jects, just as the num­ber of pos­si­ble ob­jects is ex­po­nen­tial in the num­ber of at­tributes. If a black-and-white image is 256 pix­els on a side, then the to­tal image is 65536 pix­els. The num­ber of pos­si­ble images is 265536. And the num­ber of pos­si­ble con­cepts that clas­sify images into pos­i­tive and nega­tive in­stances—the num­ber of pos­si­ble bound­aries you could draw in the space of images—is 2^(265536). From this, we see that even su­per­vised learn­ing is al­most en­tirely a mat­ter of in­duc­tive bias, with­out which it would take a min­i­mum of 265536 clas­sified ex­am­ples to dis­crim­i­nate among 2^(265536) pos­si­ble con­cepts—even if clas­sifi­ca­tions are con­stant over time.

If this seems at all coun­ter­in­tu­itive or non-ob­vi­ous, see Su­per­ex­po­nen­tial Con­ceptspace.

So let us now turn again to:

‘First we can build rel­a­tively sim­ple ma­chines that learn to rec­og­nize hap­piness and un­hap­piness in hu­man fa­cial ex­pres­sions, hu­man voices and hu­man body lan­guage. Then we can hard-wire the re­sult of this learn­ing as the in­nate emo­tional val­ues of more com­plex in­tel­li­gent ma­chines, pos­i­tively re­in­forced when we are happy and nega­tively re­in­forced when we are un­happy.’

and

‘When it is fea­si­ble to build a su­per-in­tel­li­gence, it will be fea­si­ble to build hard-wired recog­ni­tion of “hu­man fa­cial ex­pres­sions, hu­man voices and hu­man body lan­guage” (to use the words of mine that you quote) that ex­ceed the recog­ni­tion ac­cu­racy of cur­rent hu­mans such as you and me, and will cer­tainly not be fooled by “tiny molec­u­lar pic­tures of smiley-faces.” You should not as­sume such a poor im­ple­men­ta­tion of my idea that it can­not make dis­crim­i­na­tions that are triv­ial to cur­rent hu­mans.’

It’s triv­ial to dis­crim­i­nate a photo of a pic­ture with a cam­ou­flaged tank, and a photo of an empty for­est, in the sense of de­ter­min­ing that the two pho­tos are not iden­ti­cal. They’re differ­ent pixel ar­rays with differ­ent 1s and 0s in them. Discrim­i­nat­ing be­tween them is as sim­ple as test­ing the ar­rays for equal­ity.

Clas­sify­ing new pho­tos into pos­i­tive and nega­tive in­stances of “smile”, by rea­son­ing from a set of train­ing pho­tos clas­sified pos­i­tive or nega­tive, is a differ­ent or­der of prob­lem.

When you’ve got a 256x256 image from a real-world cam­era, and the image turns out to de­pict a cam­ou­flaged tank, there is no ad­di­tional 65537th bit de­not­ing the pos­i­tive­ness—no tiny lit­tle XML tag that says “This image is in­her­ently pos­i­tive”. It’s only a pos­i­tive ex­am­ple rel­a­tive to some par­tic­u­lar con­cept.

But for any non-Vast amount of train­ing data—any train­ing data that does not in­clude the ex­act bit­wise image now seen—there are su­per­ex­po­nen­tially many pos­si­ble con­cepts com­pat­i­ble with pre­vi­ous clas­sifi­ca­tions.

For the AI, choos­ing or weight­ing from among su­per­ex­po­nen­tial pos­si­bil­ities is a mat­ter of in­duc­tive bias. Which may not match what the user has in mind. The gap be­tween these two ex­am­ple-clas­sify­ing pro­cesses—in­duc­tion on the one hand, and the user’s ac­tual goals on the other—is not triv­ial to cross.

Let’s say the AI’s train­ing data is:

Dataset 1:

  • +
    • Smile_1, Smile_2, Smile_3
  • -
    • Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5

Now the AI grows up into a su­per­in­tel­li­gence, and en­coun­ters this data:

Dataset 2:

    • Frown_6, Cat_3, Smile_4, Galaxy_1, Frown_7, Nanofac­tory_1, Molec­u­lar_Smiley­face_1, Cat_4, Molec­u­lar_Smiley­face_2, Galaxy_2, Nanofac­tory_2

It is not a prop­erty of these datasets that the in­ferred clas­sifi­ca­tion you would pre­fer is:

  • +
    • Smile_1, Smile_2, Smile_3, Smile_4
  • -

    • Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5, Frown_6, Cat_3, Galaxy_1, Frown_7, Nanofac­tory_1, Molec­u­lar_Smiley­face_1, Cat_4, Molec­u­lar_Smiley­face_2, Galaxy_2, Nanofac­tory_2

rather than

  • +
    • Smile_1, Smile_2, Smile_3, Molec­u­lar_Smiley­face_1, Molec­u­lar_Smiley­face_2, Smile_4
  • -

    • Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5, Frown_6, Cat_3, Galaxy_1, Frown_7, Nanofac­tory_1, Cat_4, Galaxy_2, Nanofac­tory_2

Both of these clas­sifi­ca­tions are com­pat­i­ble with the train­ing data. The num­ber of con­cepts com­pat­i­ble with the train­ing data will be much larger, since more than one con­cept can pro­ject the same shadow onto the com­bined dataset. If the space of pos­si­ble con­cepts in­cludes the space of pos­si­ble com­pu­ta­tions that clas­sify in­stances, the space is in­finite.

Which clas­sifi­ca­tion will the AI choose? This is not an in­her­ent prop­erty of the train­ing data; it is a prop­erty of how the AI performs in­duc­tion.

Which is the cor­rect clas­sifi­ca­tion? This is not a prop­erty of the train­ing data; it is a prop­erty of your prefer­ences (or, if you pre­fer, a prop­erty of the ideal­ized ab­stract dy­namic you name “right”).

The con­cept that you wanted, cast its shadow onto the train­ing data as you your­self la­beled each in­stance + or -, draw­ing on your own in­tel­li­gence and prefer­ences to do so. That’s what su­per­vised learn­ing is all about—pro­vid­ing the AI with la­beled train­ing ex­am­ples that pro­ject a shadow of the causal pro­cess that gen­er­ated the la­bels.

But un­less the train­ing data is drawn from ex­actly the same con­text as the real-life, the train­ing data will be “shal­low” in some sense, a pro­jec­tion from a much higher-di­men­sional space of pos­si­bil­ities.

The AI never saw a tiny molec­u­lar smiley­face dur­ing its dumber-than-hu­man train­ing phase, or it never saw a tiny lit­tle agent with a hap­piness counter set to a googol­plex. Now you, fi­nally pre­sented with a tiny molec­u­lar smiley—or per­haps a very re­al­is­tic tiny sculp­ture of a hu­man face—know at once that this is not what you want to count as a smile. But that judg­ment re­flects an un­nat­u­ral cat­e­gory, one whose clas­sifi­ca­tion bound­ary de­pends sen­si­tively on your com­pli­cated val­ues. It is your own plans and de­sires that are at work when you say “No!”

Hib­bard knows in­stinc­tively that a tiny molec­u­lar smiley­face isn’t a “smile”, be­cause he knows that’s not what he wants his pu­ta­tive AI to do. If some­one else were pre­sented with a differ­ent task, like clas­sify­ing art­works, they might feel that the Mona Lisa was ob­vi­ously smil­ing—as op­posed to frown­ing, say—even though it’s only paint.

As the case of Terry Schi­avo illus­trates, tech­nol­ogy en­ables new bor­der­line cases that throw us into new, es­sen­tially moral dilem­mas. Show­ing an AI pic­tures of liv­ing and dead hu­mans as they ex­isted dur­ing the age of An­cient Greece, will not en­able the AI to make a moral de­ci­sion as to whether switch­ing off Terry’s life sup­port is mur­der. That in­for­ma­tion isn’t pre­sent in the dataset even in­duc­tively! Terry Schi­avo raises new moral ques­tions, ap­peal­ing to new moral con­sid­er­a­tions, that you wouldn’t need to think about while clas­sify­ing pho­tos of liv­ing and dead hu­mans from the time of An­cient Greece. No one was on life sup­port then, still breath­ing with a brain half fluid. So such con­sid­er­a­tions play no role in the causal pro­cess that you use to clas­sify the an­cient-Greece train­ing data, and hence cast no shadow on the train­ing data, and hence are not ac­cessible by in­duc­tion on the train­ing data.

As a mat­ter of for­mal fal­lacy, I see two an­thro­po­mor­phic er­rors on dis­play.

The first fal­lacy is un­der­es­ti­mat­ing the com­plex­ity of a con­cept we de­velop for the sake of its value. The bor­ders of the con­cept will de­pend on many val­ues and prob­a­bly on-the-fly moral rea­son­ing, if the bor­der­line case is of a kind we haven’t seen be­fore. But all that takes place in­visi­bly, in the back­ground; to Hib­bard it just seems that a tiny molec­u­lar smiley­face is just ob­vi­ously not a smile. And we don’t gen­er­ate all pos­si­ble bor­der­line cases, so we don’t think of all the con­sid­er­a­tions that might play a role in re­defin­ing the con­cept, but haven’t yet played a role in defin­ing it. Since peo­ple un­der­es­ti­mate the com­plex­ity of their con­cepts, they un­der­es­ti­mate the difficulty of in­duc­ing the con­cept from train­ing data. (And also the difficulty of de­scribing the con­cept di­rectly—see The Hid­den Com­plex­ity of Wishes.)

The sec­ond fal­lacy is an­thro­po­mor­phic op­ti­mism: Since Bill Hib­bard uses his own in­tel­li­gence to gen­er­ate op­tions and plans rank­ing high in his prefer­ence or­der­ing, he is in­cre­d­u­lous at the idea that a su­per­in­tel­li­gence could clas­sify never-be­fore-seen tiny molec­u­lar smiley­faces as a pos­i­tive in­stance of “smile”. As Hib­bard uses the “smile” con­cept (to de­scribe de­sired be­hav­ior of su­per­in­tel­li­gences), ex­tend­ing “smile” to cover tiny molec­u­lar smiley­faces would rank very low in his prefer­ence or­der­ing; it would be a stupid thing to do—in­her­ently so, as a prop­erty of the con­cept it­self—so surely a su­per­in­tel­li­gence would not do it; this is just ob­vi­ously the wrong clas­sifi­ca­tion. Cer­tainly a su­per­in­tel­li­gence can see which heaps of peb­bles are cor­rect or in­cor­rect.

Why, Friendly AI isn’t hard at all! All you need is an AI that does what’s good! Oh, sure, not ev­ery pos­si­ble mind does what’s good—but in this case, we just pro­gram the su­per­in­tel­li­gence to do what’s good. All you need is a neu­ral net­work that sees a few in­stances of good things and not-good things, and you’ve got a clas­sifier. Hook that up to an ex­pected util­ity max­i­mizer and you’re done!

I shall call this the fal­lacy of mag­i­cal cat­e­gories—sim­ple lit­tle words that turn out to carry all the de­sired func­tion­al­ity of the AI. Why not pro­gram a chess-player by run­ning a neu­ral net­work (that is, a mag­i­cal cat­e­gory-ab­sorber) over a set of win­ning and los­ing se­quences of chess moves, so that it can gen­er­ate “win­ning” se­quences? Back in the 1950s it was be­lieved that AI might be that sim­ple, but this turned out not to be the case.

The novice thinks that Friendly AI is a prob­lem of co­erc­ing an AI to make it do what you want, rather than the AI fol­low­ing its own de­sires. But the real prob­lem of Friendly AI is one of com­mu­ni­ca­tion—trans­mit­ting cat­e­gory bound­aries, like “good”, that can’t be fully delineated in any train­ing data you can give the AI dur­ing its child­hood. Rel­a­tive to the full space of pos­si­bil­ities the Fu­ture en­com­passes, we our­selves haven’t imag­ined most of the bor­der­line cases, and would have to en­gage in full-fledged moral ar­gu­ments to figure them out. To solve the FAI prob­lem you have to step out­side the paradigm of in­duc­tion on hu­man-la­beled train­ing data and the paradigm of hu­man-gen­er­ated in­ten­sional defi­ni­tions.

Of course, even if Hib­bard did suc­ceed in con­vey­ing to an AI a con­cept that cov­ers ex­actly ev­ery hu­man fa­cial ex­pres­sion that Hib­bard would la­bel a “smile”, and ex­cludes ev­ery fa­cial ex­pres­sion that Hib­bard wouldn’t la­bel a “smile”...

Then the re­sult­ing AI would ap­pear to work cor­rectly dur­ing its child­hood, when it was weak enough that it could only gen­er­ate smiles by pleas­ing its pro­gram­mers.

When the AI pro­gressed to the point of su­per­in­tel­li­gence and its own nan­otech­nolog­i­cal in­fras­truc­ture, it would rip off your face, wire it into a per­ma­nent smile, and start xe­rox­ing.

The deep an­swers to such prob­lems are be­yond the scope of this post, but it is a gen­eral prin­ci­ple of Friendly AI that there are no bandaids. In 2004, Hib­bard mod­ified his pro­posal to as­sert that ex­pres­sions of hu­man agree­ment should re­in­force the defi­ni­tion of hap­piness, and then hap­piness should re­in­force other be­hav­iors. Which, even if it worked, just leads to the AI xe­rox­ing a horde of things similar-in-its-con­ceptspace to pro­gram­mers say­ing “Yes, that’s hap­piness!” about hy­dro­gen atoms—hy­dro­gen atoms are easy to make.

Link to my dis­cus­sion with Hib­bard here. You already got the im­por­tant parts.