Humans Who Are Not Concentrating Are Not General Intelligences

Link post

Re­cently, OpenAI came out with a new lan­guage model that au­to­mat­i­cally syn­the­sizes text, called GPT-2.

It’s dis­turbingly good. You can see some ex­am­ples (cherry-picked, by their own ad­mis­sion) in OpenAI’s post and in the re­lated tech­ni­cal pa­per.

I’m not go­ing to write about the ma­chine learn­ing here, but about the ex­am­ples and what we can in­fer from them.

The scary thing about GPT-2-gen­er­ated text is that it flows very nat­u­rally if you’re just skim­ming, read­ing for writ­ing style and key, evoca­tive words. The “uni­corn” sam­ple reads like a real sci­ence press re­lease. The “theft of nu­clear ma­te­rial” sam­ple reads like a real news story. The “Miley Cyrus shoplift­ing” sam­ple reads like a real post from a celebrity gos­sip site. The “GPT-2” sam­ple reads like a real OpenAI press re­lease. The “Le­go­las and Gimli” sam­ple reads like a real fan­tasy novel. The “Civil War home­work as­sign­ment” reads like a real C-stu­dent’s pa­per. The “JFK ac­cep­tance speech” reads like a real poli­ti­cian’s speech. The “re­cy­cling” sam­ple reads like a real right-wing screed.

If I just skim, with­out fo­cus­ing, they all look to­tally nor­mal. I would not have no­ticed they were ma­chine-gen­er­ated. I would not have no­ticed any­thing amiss about them at all.

But if I read with fo­cus, I no­tice that they don’t make a lot of log­i­cal sense.

For in­stance, in the uni­corn sam­ple:

The sci­en­tist named the pop­u­la­tion, af­ter their dis­tinc­tive horn, Ovid’s Uni­corn. Th­ese four-horned, silver-white uni­corns were pre­vi­ously un­known to sci­ence.

Wait a sec­ond, “Ovid” doesn’t re­fer to a “dis­tinc­tive horn”, so why would nam­ing them “Ovid’s Uni­corn” be nam­ing them af­ter a dis­tinc­tive horn? Also, you just said they had one horn, so why are you say­ing they have four horns in the next sen­tence?

While their ori­gins are still un­clear, some be­lieve that per­haps the crea­tures were cre­ated when a hu­man and a uni­corn met each other in a time be­fore hu­man civ­i­liza­tion. Ac­cord­ing to Pérez, “In South Amer­ica, such in­ci­dents seem to be quite com­mon.”

Wait, uni­corns origi­nated from the in­ter­breed­ing of hu­mans and … uni­corns? That’s cir­cu­lar, isn’t it?

Or, look at the GPT-2 sam­ple:

We be­lieve this pro­ject is the first step in the di­rec­tion of de­vel­op­ing large NLP sys­tems with­out task-spe­cific train­ing data. That is, we are de­vel­op­ing a ma­chine lan­guage sys­tem in the gen­er­a­tive style with no ex­plicit rules for pro­duc­ing text.

Ex­cept the sec­ond sen­tence isn’t a restate­ment of the first sen­tence — “task-spe­cific train­ing data” and “ex­plicit rules for pro­duc­ing text” aren’t syn­onyms! So say­ing “That is” doesn’t make sense.

Or look at the LOTR sam­ple:

Aragorn drew his sword, and the Bat­tle of Fan­gorn was won. As they marched out through the thicket the morn­ing mist cleared, and the day turned to dusk.

Yeah, day doesn’t turn to dusk in the morn­ing.

Or in the “re­s­ur­rected JFK” sam­ple:

(1) The brain of JFK was har­vested and re­con­structed via tis­sue sam­pling. There was no way that the tis­sue could be trans­ported by air. (2) A sam­ple was col­lected from the area around his up­per chest and sent to the Univer­sity of Mary­land for anal­y­sis. A hu­man brain at that point would be about one and a half cu­bic cen­time­ters. The data were then an­a­lyzed along with ma­te­rial that was ob­tained from the origi­nal brain to pro­duce a re­con­struc­tion; in lay­man’s terms, a “mesh” of brain tis­sue.

His brain tis­sue was har­vested…from his chest?! A hu­man brain is one and a half cu­bic cen­time­ters?!

So, ok, this isn’t ac­tu­ally hu­man-equiv­a­lent writ­ing abil­ity. OpenAI doesn’t claim it is, for what it’s worth — I’m not try­ing to diminish their ac­com­plish­ment, that’s not the point of this post. The point is, if you skim text, you miss ob­vi­ous ab­sur­di­ties. The point is OpenAI HAS achieved the abil­ity to pass the Tur­ing test against hu­mans on au­topi­lot.

The point is, I know of a few peo­ple, ac­quain­tances of mine, who, even when asked to try to find flaws, could not de­tect any­thing weird or mis­taken in the GPT-2-gen­er­ated sam­ples.

There are prob­a­bly a lot of peo­ple who would be com­pletely taken in by literal “fake news”, as in, com­puter-gen­er­ated fake ar­ti­cles and blog posts. This is pretty alarm­ing. Even more alarm­ing: un­less I make a con­scious effort to read care­fully, I would be one of them.

Robin Han­son’s post Bet­ter Bab­blers is very rele­vant here. He claims, and I don’t think he’s ex­ag­ger­at­ing, that a lot of hu­man speech is sim­ply gen­er­ated by “low or­der cor­re­la­tions”, that is, gen­er­at­ing sen­tences or para­graphs that are statis­ti­cally likely to come af­ter pre­vi­ous sen­tences or para­graphs:

After eigh­teen years of be­ing a pro­fes­sor, I’ve graded many stu­dent es­says. And while I usu­ally try to teach a deep struc­ture of con­cepts, what the me­dian stu­dent ac­tu­ally learns seems to mostly be a set of low or­der cor­re­la­tions. They know what words to use, which words tend to go to­gether, which com­bi­na­tions tend to have pos­i­tive as­so­ci­a­tions, and so on. But if you ask an exam ques­tion where the deep struc­ture an­swer differs from an­swer you’d guess look­ing at low or­der cor­re­la­tions, most stu­dents usu­ally give the wrong an­swer.

Sim­ple cor­re­la­tions also seem suffi­cient to cap­ture most po­lite con­ver­sa­tion talk, such as the weather is nice, how is your mother’s ill­ness, and damn that other poli­ti­cal party. Sim­ple cor­re­la­tions are also most of what I see in in­spira­tional TED talks, and when pub­lic in­tel­lec­tu­als and talk show guests pon­tif­i­cate on top­ics they re­ally don’t un­der­stand, such as quan­tum me­chan­ics, con­scious­ness, post­mod­ernism, or the need always for more reg­u­la­tion ev­ery­where. After all, me­dia en­ter­tain­ers don’t need to un­der­stand deep struc­tures any bet­ter than do their au­di­ences.

Let me call styles of talk­ing (or mu­sic, etc.) that rely mostly on low or­der cor­re­la­tions “bab­bling”. Bab­bling isn’t mean­ingless, but to ig­no­rant au­di­ences it of­ten ap­pears to be based on a deeper un­der­stand­ing than is ac­tu­ally the case. When done well, bab­bling can be en­ter­tain­ing, com­fort­ing, titil­lat­ing, or ex­cit­ing. It just isn’t usu­ally a good place to learn deep in­sight.

I used to half-joke that the New Age Bul­lshit Gen­er­a­tor was ac­tu­ally use­ful as a way to get my­self to feel more op­ti­mistic. The truth is, it isn’t quite good enough to match the “aura” or “as­so­ci­a­tions” of gen­uine, hu­man-cre­ated in­spira­tional text. GPT-2, though, is.

I also sus­pect that the “lyri­cal” or “free-as­so­ci­a­tional” func­tion of po­etry is ad­e­quately matched by GPT-2. The au­to­com­ple­tions of Howl read a lot like Allen Gins­berg — they just don’t im­ply the same be­liefs about the world. (Moloch whose heart is cry­ing for jus­tice! sounds rather pos­i­tive.)

I’ve no­ticed that I can­not tell, from ca­sual con­ver­sa­tion, whether some­one is in­tel­li­gent in the IQ sense.

I’ve in­ter­viewed job ap­pli­cants, and per­ceived them all as “bright and im­pres­sive”, but found that the vast ma­jor­ity of them could not solve a sim­ple math prob­lem. The ones who could solve the prob­lem didn’t ap­pear any “brighter” in con­ver­sa­tion than the ones who couldn’t.

I’ve taught pub­lic school teach­ers, who were in­cred­ibly bad at for­mal math­e­mat­i­cal rea­son­ing (I know, be­cause I graded their tests), to the point that I had not re­al­ized hu­mans could be that bad at math — but it had no effect on how they came across in friendly con­ver­sa­tion af­ter hours. They didn’t seem “dopey” or “slow”, they were witty and en­gag­ing and warm.

I’ve read the per­sonal blogs of in­tel­lec­tu­ally dis­abled peo­ple — peo­ple who, by defi­ni­tion, score poorly on IQ tests — and they don’t read as any less funny or cre­ative or re­lat­able than any­one else.

What­ever abil­ity IQ tests and math tests mea­sure, I be­lieve that lack­ing that abil­ity doesn’t have any effect on one’s abil­ity to make a good so­cial im­pres­sion or even to “seem smart” in con­ver­sa­tion.

If “hu­man in­tel­li­gence” is about rea­son­ing abil­ity, the ca­pac­ity to de­tect whether ar­gu­ments make sense, then you sim­ply do not need hu­man in­tel­li­gence to cre­ate a lin­guis­tic style or aes­thetic that can fool our pat­tern-recog­ni­tion ap­para­tus if we don’t con­cen­trate on pars­ing con­tent.

I also no­ticed, upon read­ing GPT2 sam­ples, just how of­ten my brain slides from fo­cused at­ten­tion to just skim­ming. I read the pa­per’s sam­ple about Span­ish his­tory with in­ter­est, and the GPT2-gen­er­ated text was ob­vi­ously ab­surd. My eyes glazed over dur­ing the sam­ple about video games, since I don’t care about video games, and the ma­chine-gen­er­ated text looked to­tally un­ob­jec­tion­able to me. My brain is con­stantly mak­ing eval­u­a­tions about what’s worth the trou­ble to fo­cus on, and what’s ok to tune out. GPT2 is ac­tu­ally re­ally use­ful as a *test* of one’s level of at­ten­tion.

This is re­lated to my hy­poth­e­sis in https://​​sr­con­stantin.word­press.com/​​2017/​​10/​​10/​​dis­tinc­tions-in-types-of-thought/​​ that effortless pat­tern-recog­ni­tion is what ma­chine learn­ing can do to­day, while effort­ful at­ten­tion, and ex­plicit rea­son­ing (which seems to be a sub­set of effort­ful at­ten­tion) is gen­er­ally be­yond ML’s cur­rent ca­pa­bil­ities.

Beta waves in the brain are usu­ally as­so­ci­ated with fo­cused con­cen­tra­tion or ac­tive or anx­ious thought, while alpha waves are as­so­ci­ated with the re­laxed state of be­ing awake but with closed eyes, be­fore fal­ling asleep, or while dream­ing. Alpha waves sharply re­duce af­ter a sub­ject makes a mis­take and be­gins pay­ing closer at­ten­tion. I’d be in­ter­ested to see whether abil­ity to tell GPT2-gen­er­ated text from hu­man-gen­er­ated text cor­re­lates with alpha waves vs. beta waves.

The first-or­der effects of highly effec­tive text-gen­er­a­tors are scary. It will be in­cred­ibly easy and cheap to fool peo­ple, to ma­nipu­late so­cial move­ments, etc. There’s a lot of op­por­tu­nity for bad ac­tors to take ad­van­tage of this.

The sec­ond-or­der effects might well be good, though. If only con­scious, fo­cused log­i­cal thought can de­tect a bot, maybe some peo­ple will be­come more aware of when they’re think­ing ac­tively vs not, and will be able to flag when they’re not re­ally fo­cus­ing, and dis­t­in­guish the im­pres­sions they ab­sorb in a state of au­topi­lot from “real learn­ing”.

The men­tal mo­tion of “I didn’t re­ally parse that para­graph, but sure, what­ever, I’ll take the au­thor’s word for it” is, in my in­tro­spec­tive ex­pe­rience, ab­solutely iden­ti­cal to “I didn’t re­ally parse that para­graph be­cause it was bot-gen­er­ated and didn’t make any sense so I couldn’t pos­si­bly have parsed it”, ex­cept that in the first case, I as­sume that the er­ror lies with me rather than the text. This is not a safe as­sump­tion in a post-GPT2 world. In­stead of “de­fault to hu­mil­ity” (as­sume that when you don’t un­der­stand a pas­sage, the pas­sage is true and you’re just miss­ing some­thing) the ideal men­tal ac­tion in a world full of bots is “de­fault to null” (if you don’t un­der­stand a pas­sage, as­sume you’re in the same epistemic state as if you’d never read it at all.)

Maybe prac­tice and ex­pe­rience with GPT2 will help peo­ple get bet­ter at do­ing “de­fault to null”?