human psycholinguists: a critical appraisal

Link post

(The ti­tle of this post is a jok­ing homage to one of Gary Mar­cus’ pa­pers.)

I’ve dis­cussed GPT-2 and BERT and other in­stances of the Trans­former ar­chi­tec­ture a lot on this blog. As you can prob­a­bly tell, I find them very in­ter­est­ing and ex­cit­ing. But not ev­ery­one has the re­ac­tion I do, in­clud­ing some peo­ple who I think ought to have that re­ac­tion.

What­ever else GPT-2 and friends may or may not be, I think they are clearly a source of fas­ci­nat­ing and novel sci­en­tific ev­i­dence about lan­guage and the mind. That much, I think, should be un­con­tro­ver­sial. But it isn’t.

(i.)

When I was a teenager, I went through a pe­riod where I was very in­ter­ested in cog­ni­tive psy­chol­ogy and psy­chol­in­guis­tics. I first got in­ter­ested via Steven Pinker’s pop­u­lar books – this was back when Pinker was mostly fa­mous for writ­ing about psy­chol­ogy rather than his­tory and cul­ture – and pro­ceeded to read other, more aca­demic books by au­thors like Gary Mar­cus, Jerry Fodor, and John An­der­son.

At this time (roughly 2002-6), there was noth­ing out there that re­motely re­sem­bled GPT-2. Although there were ap­par­ently quite ma­ture and com­plete for­mal the­o­ries of mor­phol­ogy and syn­tax, which could ac­cu­rately an­swer ques­tions like “is this a well-formed English sen­tence?”, no one re­ally knew how these could or should be im­ple­mented in a phys­i­cal sys­tem meant to un­der­stand or pro­duce lan­guage.

This was true in two ways. For one thing, no one knew how the hu­man brain im­ple­mented this stuff, al­though ap­par­ently it did. But the difficulty was more se­vere than that: even if you for­got about the brain, and just tried to write a com­puter pro­gram (any com­puter pro­gram) that un­der­stood or pro­duced lan­guage, the re­sults would be dis­mal.

At the time, such pro­grams were ei­ther spe­cial­ized aca­demic mod­els of one spe­cific phe­nomenon – for ex­am­ple, a pro­gram that could form the past tense of a verb, but couldn’t do any­thing else – or they were os­ten­si­bly gen­eral-pur­pose but in­cred­ibly brit­tle and er­ror-prone, lit­tle more than amus­ing toys. The lat­ter cat­e­gory in­cluded some pro­grams in­tended as mere amuse­ments or provo­ca­tions, like the var­i­ous chat­ter­bots (still about as good/​bad as ELIZA af­ter four decades), but also more se­ri­ous efforts whose reach ex­ceeded their grasp. SYSTRAN spent decades man­u­ally cu­rat­ing mil­lions of mor­phosyn­tac­tic and se­man­tic facts for en­ter­prise-grade ma­chine trans­la­tion; you may re­mem­ber the re­sults in the form of the good old Ba­bel Fish web­site, in­fa­mous for its hilar­i­ously in­ept trans­la­tions.

This was all kind of sur­pris­ing, given that the ma­ture for­mal the­o­ries were right there, ready to be pro­grammed into rule-fol­low­ing ma­chines. What was go­ing on?

The im­pres­sion I came away with, read­ing about this stuff as a teenager, was of lan­guage as a fas­ci­nat­ing and daunt­ing enigma, si­mul­ta­neously rule-based and rife with end­less spe­cial cases that stacked upon one an­other. It was for­mal­ism, Jim, but not as we knew it; it was a magic in­ter­leav­ing of reg­u­lar and ir­reg­u­lar phe­nom­ena, aris­ing out of the dis­tinc­tive com­pu­ta­tional prop­er­ties of some not-yet-un­der­stood sub­set of brain ar­chi­tec­ture, which the mod­els of aca­demics and hack­ers could crudely imi­tate but not re­ally grok. We did not have the right “lan­guage” to talk about lan­guage the way our own brains did, in­ter­nally.

(ii.)

The books I read, back then, talked a lot about this thing called “con­nec­tion­ism.”

This used to be a big aca­demic de­bate, with peo­ple ar­gu­ing for and against “con­nec­tion­ism.” You don’t hear that term much these days, be­cause the de­bate has been re­placed by a su­perfi­cially similar but ac­tu­ally very differ­ent de­bate over “deep learn­ing,” in which what used to be good ar­gu­ments about “con­nec­tion­ism” are re­peated in cruder form as bad ar­gu­ments about “deep learn­ing.”

But I’m get­ting ahead of my­self. What was the old de­bate about?

As you may know, the pi­o­neers of deep learn­ing had been pi­o­neer­ing it for many years be­fore it went main­stream. What we now call “neu­ral nets” were in­vented step by step a very long time ago, and very early and prim­i­tive neu­ral nets were pro­moted with far too much zeal as long ago as the 60s.

First there was the “Per­cep­tron,” a sin­gle-layer fully-con­nected net­work with an up­date rule that didn’t scale to more lay­ers. It gen­er­ated a lot of un­jus­tified hype, and was then “re­futed” in in­im­itable petty-aca­demic fash­ion by Minksy and Papert’s book Per­cep­trons, a math­e­mat­i­cally over-elab­o­rate ex­pres­sion of the sim­ple and ob­vi­ous fact that no sin­gle-layer net can ex­press XOR. (Be­cause no lin­ear clas­sifier can! Duh!)

Then the neu­ral net peo­ple came back, armed with “hid­den lay­ers” (read: “more than one layer”) trained by “back­prop­a­ga­tion” (read: “effi­cient gra­di­ent de­scent”). Th­ese had much greater ex­pres­sive power, and amounted to a form of non­lin­ear re­gres­sion which could learn fairly ar­bi­trary func­tion classes from data.

Some peo­ple in psy­chol­ogy be­came in­ter­ested in us­ing them as a model for hu­man learn­ing. AFAIK this was sim­ply be­cause non­lin­ear re­gres­sion kind of looks like learn­ing (it is now called “ma­chine learn­ing”), and be­cause of the very loose but much-dis­cussed re­sem­blance be­tween these mod­els and the lay­ered ar­chi­tec­ture of real cor­ti­cal neu­rons. The use of neu­ral nets as mod­el­ing tools in psy­chol­ogy be­came known as “con­nec­tion­ism.”

Why was there a de­bate over con­nec­tion­ism? To opine: be­cause the neu­ral nets of the time (80s to early 90s) re­ally sucked. Weight shar­ing ar­chi­tec­tures like CNN and LSTM hadn’t been in­vented yet; ev­ery­thing was ei­ther a fully-con­nected net or a cus­tom ar­chi­tec­ture sus­pi­ciously jerry-rigged to make the right choices on some spe­cial­ized task. And these things were be­ing used to model highly reg­u­lar, rule-gov­erned phe­nom­ena, like verb in­flec­tion – cases where, even when hu­man chil­dren make some ini­tial mis­takes, those mis­takes them­selves have a reg­u­lar struc­ture.

The con­nec­tion­ist mod­els typ­i­cally failed to re­pro­duce this struc­ture; where hu­man kids typ­i­cally err by ap­ply­ing a generic rule to an ex­cep­tional case (“I made you a cookie, but I eated it” – a cute meme be­cause an au­then­ti­cally childlike one), the mod­els would err by pro­duc­ing in­hu­man “blends,” rec­og­niz­ing the ex­cep­tion yet ap­ply­ing the rule any­way (“I ated it”).

There were already good mod­els of cor­rect verb in­flec­tion, and gen­er­ally of cor­rect ver­sions of all these be­hav­iors. Namely, the for­mal rule sys­tems I referred to ear­lier. What these sys­tems lacked (by them­selves) was a model of learn­ing, of rule-sys­tem ac­qui­si­tion. The con­nec­tion­ist mod­els pur­ported to provide this – but they didn’t work.

(iii.)

In 2001, a former grad stu­dent of Pinker’s named Gary Mar­cus wrote an in­ter­est­ing book called The Alge­braic Mind: In­te­grat­ing Con­nec­tion­ism and Cog­ni­tive Science. As a teenager, I read it with en­thu­si­asm.

Here is a gloss of Mar­cus’ po­si­tion as of this book. Quote-for­mat­ted to sep­a­rate it from the main text, but it’s my writ­ing, not a quote:

The best ex­ist­ing mod­els of many psy­cholog­i­cal phe­nom­ena are for­mal sym­bolic ones. They look like math or like com­puter pro­grams. For in­stance, they in­volve gen­eral rules con­tain­ing vari­ables, lit­tle “X”s that stand in iden­ti­cally for ev­ery sin­gle mem­ber of some broad do­main. (Reg­u­lar verb in­flec­tion takes any X and tacks “-ed” on the end. As Mar­cus ob­serves, we can do this on the fly with novel words, as when some­one talks of a poli­ti­cian who has “out-Gor­bacheved Gor­bachev.”)

The con­nec­tion­ism de­bate has con­flated at least two ques­tions: “does the brain im­ple­ment for­mal sym­bol-ma­nipu­la­tion?” and “does the brain work some­thing like a ‘neu­ral net’ model?” The as­sump­tion has been that neu­ral nets don’t ma­nipu­late sym­bols, so if one an­swer is “yes” the other must be “no.” But the as­sump­tion is false: some neu­ral nets re­ally do im­ple­ment (ap­prox­i­mate) sym­bol ma­nipu­la­tion.

This in­cludes some, but not all, of the pop­u­lar “con­nec­tion­ist” mod­els, de­spite the fact that any “con­nec­tion­ist” suc­cess tends to be viewed as a strike against sym­bol ma­nipu­la­tion. More­over (Mar­cus ar­gues), the con­nec­tion­ist nets that suc­ceed as psy­cholog­i­cal mod­els are the ones that im­ple­ment sym­bol ma­nipu­la­tion. So the ev­i­dence is ac­tu­ally con­ver­gent: the best mod­els ma­nipu­late sym­bols, in­clud­ing the best neu­ral net mod­els.

As­sum­ing the brain does do sym­bol ma­nipu­la­tion, as the ev­i­dence sug­gests, what re­mains to be an­swered is how it does it. For­mal rules are nat­u­ral to rep­re­sent in a cen­tral­ized ar­chi­tec­ture like a Tur­ing ma­chine; how might they be en­coded in a dis­tributed ar­chi­tec­ture like a brain? And how might these com­plex mechanisms be re­li­ably built, given only the limited in­for­ma­tion con­tent of the genome?

To an­swer these ques­tions, we’ll need mod­els that look sort of like neu­ral nets, in that they use mas­sively par­allel ar­rays of small units with limited cen­tral con­trol, and build them­selves to do com­pu­ta­tions no one has ex­plic­itly “writ­ten out.”

But, to do the job, these mod­els can’t be the dumb generic putty of a fully-con­nected neu­ral net trained with gra­di­ent de­scent. (Mar­cus cor­rectly ob­serves that those mod­els can’t gen­er­al­ize across un­seen in­put and out­put nodes, and thus re­quire in­nate knowl­edge to be sneak­ily baked in to the in­put/​out­put rep­re­sen­ta­tions.) They need spe­cial pre-built wiring of some sort, and the proper task of neu­ral net mod­els in psy­chol­ogy is to say what this wiring might look like. (Mar­cus pro­poses, e.g., an ar­chi­tec­ture called “treelets” for re­cur­sive rep­re­sen­ta­tions. Re­mem­ber this was be­fore the pop­u­lar adop­tion of CNNs, LSTMs, etc., so this was as much a point pre­sag­ing mod­ern deep learn­ing as a point against mod­ern deep learn­ing; in­deed I can find no sen­si­ble way to read it as the lat­ter at all.)

Now, this was all very sen­si­ble and in­ter­est­ing, back in the early 2000s. It still is. I agree with it.

What has hap­pened since the early 2000s? Among other things: an ex­plo­sion of new neu­ral net ar­chi­tec­tures with more in­nate struc­ture than the old “con­nec­tion­ist” mod­els. CNNs, LSTMs, re­cur­sive net­works, mem­ory net­works, poin­ter net­works, at­ten­tion, trans­form­ers. Ba­si­cally all of these ad­vances were made to solve the sorts of prob­lems Mar­cus was in­ter­ested in, back in 2001 – to wire up net­works so they could na­tively en­code the right kinds of ab­strac­tions for hu­man-like gen­er­al­iza­tion, be­fore they saw any data at all. And they’ve been im­mensely suc­cess­ful!

What’s more, the suc­cesses have pat­terns. The suc­cess of GPT-2 and BERT was not a mat­ter of plug­ging more and more data into fun­da­men­tally dumb putty. (I mean, it in­volved huge amounts of data, but so does hu­man child­hood.) The trans­former ar­chi­tec­ture was a real rep­re­sen­ta­tional ad­vance: sud­denly, by switch­ing from one sort of wiring to an­other sort of wiring, the wired-up ma­chines did way bet­ter at lan­guage.

Per­haps – as the Gary Mar­cus of 2001 said – when we look at which neu­ral net ar­chi­tec­tures suc­ceed in imi­tat­ing hu­man be­hav­ior, we can learn some­thing about how the hu­man brain ac­tu­ally works.

Back in 2001, when neu­ral nets strug­gled to model even sim­ple lin­guis­tic phe­nom­ena in iso­la­tion, Mar­cus sur­veyed 21 (!) such net­works in­tended as mod­els of the English past tense. Here is part of his con­clud­ing dis­cus­sion:

The past tense ques­tion origi­nally be­came pop­u­lar in 1986 when Rumelhart and McClel­land (1986a) asked whether we re­ally have men­tal rules. Un­for­tu­nately, as the proper ac­count of the past tense has be­come in­creas­ingly dis­cussed, Rumelhart and McClel­land’s straight­for­ward ques­tion has be­come twice cor­rupted. Their origi­nal ques­tion was “Does the mind have rules in any­thing more than a de­scrip­tive sense?” From there, the ques­tion shifted to the less in­sight­ful “Are there two pro­cesses or one?” and fi­nally to the very un­in­for­ma­tive “Can we build a con­nec­tion­ist model of the past tense?” The “two pro­cesses or one?” ques­tion is less in­sight­ful be­cause the na­ture of pro­cesses—not the sheer num­ber of pro­cesses—is im­por­tant. […] The sheer num­ber tells us lit­tle, and it dis­tracts at­ten­tion from Rumelhart and McClel­land’s origi­nal ques­tion of whether (alge­braic) rules are im­pli­cated in cog­ni­tion.

The “Can we build a con­nec­tion­ist model of the past tense?” ques­tion is even worse, for it en­tirely ig­nores the un­der­ly­ing ques­tion about the sta­tus of men­tal rules. The im­plicit premise is some­thing like “If we can build an em­piri­cally ad­e­quate con­nec­tion­ist model of the past tense, we won’t need rules.” But as we have seen, this premise is false: many con­nec­tion­ist mod­els im­ple­ment rules, some­times in­ad­ver­tently. […]

The right ques­tion is not “Can any con­nec­tion­ist model cap­ture the facts of in­flec­tion?” but rather “What de­sign fea­tures must a con­nec­tion­ist model that cap­tures the facts of in­flec­tion in­cor­po­rate?” If we take what the mod­els are tel­ling us se­ri­ously, what we see is that those con­nec­tion­ist mod­els that come close to im­ple­ment­ing the rule-and-mem­ory model far out­perform their more rad­i­cal cous­ins. For now, as sum­ma­rized in table 3.4, it ap­pears that the closer the past tense mod­els come to re­ca­pitu­lat­ing the ar­chi­tec­ture of the sym­bolic mod­els – by in­cor­po­rat­ing the ca­pac­ity to in­stan­ti­ate vari­ables with in­stances and to ma­nipu­late (here, “copy” and “suffix”) the in­stances of those vari­ables – the bet­ter they perform.

Con­nec­tion­ist mod­els can tell us a great deal about cog­ni­tive ar­chi­tec­ture but only if we care­fully ex­am­ine the differ­ences be­tween mod­els. It is not enough to say that some con­nec­tion­ist model will be able to han­dle the task. In­stead, we must ask what ar­chi­tec­tural prop­er­ties are re­quired. What we have seen is that mod­els that in­clude ma­chin­ery for op­er­a­tions over vari­ables suc­ceed and that mod­els that at­tempt to make do with­out such ma­chin­ery do not.

Now, okay, there is no di­rect com­par­i­son be­tween these mod­els and GPT-2 /​ BERT. For these mod­els were meant as fine-grained ac­counts of one spe­cific phe­nomenon, and what mat­tered most was how they han­dled edge cases, even which er­rors they made when they did err.

By con­trast, the pop­u­lar trans­former mod­els are pri­mar­ily im­pres­sive as mod­els of typ­i­cal-case com­pe­tence: they sure look like they are fol­low­ing the rules in many re­al­is­tic cases, but it is less clear whether their edge be­hav­ior and their gen­er­al­iza­tions to very un­com­mon situ­a­tions ex­tend the rules in the char­ac­ter­is­tic ways we do.

And yet. And yet …

(iv.)

In 2001, in the era of my teenage psy­cho-cog­ni­tive-lin­guis­tics phase, com­put­ers couldn’t do syn­tax, much less se­man­tics, much less style, tone, so­cial nu­ance, di­alect. Im­mense effort was poured into simu­lat­ing com­par­a­tively triv­ial cases like the English past tense in iso­la­tion, or mak­ing mas­sive brit­tle sys­tems like Ba­bel Fish, thou­sands of hours of ex­pert cu­ra­tion lead­ing up to gib­ber­ish that gave me a good laugh in 5th grade.

GPT-2 does syn­tax. I mean, it re­ally does it. It is com­pe­tent.

A con­ven­tion­ally trained psy­chol­in­guist might quib­ble, ask­ing things like “does it pass the wug test?” I’ve tried it, and the re­sults are … kind of equiv­o­cal. So maybe GPT-2 doesn’t re­spond to probes of edge case be­hav­ior the way hu­man chil­dren do.

But if so, then so much the worse for the wug test. Or rather: if so, we have learned some­thing about which kinds of lin­guis­tic com­pe­tence are pos­si­ble in iso­la­tion, with­out some oth­ers.

What does GPT-2 do? It fuck­ing writes. Short pithy sen­tences, long flow­ing beau­tiful sen­tences, ev­ery­thing in be­tween – and al­most always well-formed, nouns and verbs agree­ing, ir­reg­u­lars cor­rectly in­flected, big com­po­si­tional stacks of clauses lin­ing up just the way they’re sup­posed to. Gary Mar­cus was right: you can’t do this with a vanilla fully-con­nected net, or even with one of many more so­phis­ti­cated ar­chi­tec­tures. You need the right ar­chi­tec­ture. You need, maybe, just maybe, an ar­chi­tec­ture that can tell us a thing or two about the hu­man brain.

GPT-2 fuck­ing writes. Syn­tax, yes, and style: it knows the way sen­tences bob and weave, the spe­cial rhythms of many kinds of good prose and of many kinds of dis­tinc­tively bad prose. Idioms, col­lo­quial­isms, self-con­sis­tent lit­tle wor­lds of lan­guage.

I think maybe the full effect is muted by those ser­vices peo­ple use that just let you type a prompt and get a con­tinu­a­tion back from the base GPT-2 model; with those you’re ask­ing a ques­tion that is fun­da­men­tally ill-posed (“what is the cor­rect way to finish this para­graph?” – there isn’t one, of course). What’s more im­pres­sive to me is fine-tun­ing on spe­cific texts in con­junc­tion with un­con­di­tional gen­er­a­tion, push­ing the model in the di­rec­tion of a spe­cific kind of writ­ing and then let­ting the model work freestyle.

One day I fed in some Vladimir Nabokov ebooks on a whim, and when I came back from work the damn thing was writ­ing stuff that would be good com­ing from the real Nabokov. In an­other pro­ject, I elic­ited spook­ily good, of­ten hilar­i­ous and/​or beau­tiful imi­ta­tions of a cer­tain no­to­ri­ous blog­ger (cu­rated se­lec­tions here). More re­cently I’ve got­ten more am­bi­tious, and have used some en­cod­ing tricks to­gether with fine-tun­ing to in­ter­ac­tively simu­late my­self. Speak­ing as, well, a sort of ex­pert on what I sound like, I can tell you that – in sci­en­tific par­lance – the re­sults have been trippy as hell.

Look, I know I’m de­vi­at­ing away from struc­tured aca­demic point-mak­ing into fuzzy emo­tive goop­iness, but … I like words, I like read­ing and writ­ing, and when I look at this thing, I rec­og­nize some­thing.

Th­ese ma­chines can do scores of differ­ent things that, in­di­vi­d­u­ally, looked like fun­da­men­tal challenges in 2001. They don’t always do them “the right way,” by the canons of psy­chol­in­guis­tics; in edge cases they might zig where a hu­man child would zag. But they do things the right way by the canon of me, ac­cord­ing to the lin­guis­tic com­pe­tence of a hu­man adult with prop­erly func­tion­ing lan­guage cir­cuits in his cor­tex.

What does it mean for psy­chol­in­guis­tics, that a ma­chine ex­ists which can write but not wug, which can run but not walk? It means a whole lot. It means it is pos­si­ble to run with­out be­ing able to walk. If the canons of psy­chol­in­guis­tics say this is im­pos­si­ble, so much the worse for them, and so much the bet­ter for our un­der­stand­ing of the hu­man brain.

(v.)

Does the dis­tinc­tive, oddly sim­ple struc­ture of the trans­former bear some func­tional similar­ity to the cir­cuit de­sign of, I don’t know, Broca’s area? I have tried, with my great ig­no­rance of ac­tual neu­ro­biol­ogy, to look into this ques­tion, and I have not had much suc­cess.

But if there’s any­one out there less ig­no­rant than me who agrees with the Gary Mar­cus of 2001, this ques­tion should be burn­ing in their mind. PhDs should be done on this. Ca­reers should be made from the ques­tion: what do the lat­est neu­ral nets teach us, not about “AI,” but about the hu­man brain? We are sit­ting on a trove of psy­chol­in­guis­tic ev­i­dence so won­der­ful and dis­tinc­tive, we didn’t even imag­ine it as a pos­si­bil­ity, back in the early 2000s.

This is won­der­ful! This is the food that will feed rev­olu­tions in your field! What are you do­ing with it?

(vi.)

The an­swer to that ques­tion is the real rea­son this es­say ex­ists, and the rea­son it takes such an oddly ir­ri­ta­ble tone.

Here is Gary Mar­cus in 2001:

When I was search­ing for grad­u­ate pro­grams, I at­tended a brilli­ant lec­ture by Steven Pinker in which he com­pared PDP [i.e. con­nec­tion­ist -nos­talge­braist] and sym­bol-ma­nipu­la­tion ac­counts of the in­flec­tion of the English past tense. The lec­ture con­vinced me that I needed to work with Pinker at MIT. Soon af­ter I ar­rived, Pinker and I be­gan col­lab­o­rat­ing on a study of chil­dren’s over-reg­u­lariza­tion er­rors (breaked, eated, and the like). In­fected by Pinker’s en­thu­si­asm, the minu­tiae of English ir­reg­u­lar verbs came to per­vade my ev­ery thought.

Among other things, the re­sults we found ar­gued against a par­tic­u­lar kind of neu­ral net­work model. As I be­gan giv­ing lec­tures on our re­sults, I dis­cov­ered a com­mu­ni­ca­tion prob­lem. No mat­ter what I said, peo­ple would take me as ar­gu­ing against all forms of con­nec­tion­ism. No mat­ter how much I stressed the fact that other, more so­phis­ti­cated kinds of net­work mod­els [! -nos­talge­braist] were left un­touched by our re­search, peo­ple always seem to come away think­ing, “Mar­cus is an anti-con­nec­tion­ist.”

But I am not an anti-con­nec­tion­ist; I am op­posed only to a par­tic­u­lar sub­set of the pos­si­ble con­nec­tion­ist mod­els. The prob­lem is that the term con­nec­tion­ism has be­come syn­ony­mous with a sin­gle kind of net­work model, a kind of em­piri­cist model with very lit­tle in­nate struc­ture, a type of model that uses a learn­ing al­gorithm known as back-prop­a­ga­tion. Th­ese are not the only kinds of con­nec­tion­ist mod­els that could be built; in­deed, they are not even the only kinds of con­nec­tion­ist mod­els that are be­ing built, but be­cause they are so rad­i­cal, they con­tinue to at­tract most of the at­ten­tion.

A ma­jor goal of this book is to con­vince you, the reader, that the type of net­work that gets so much at­ten­tion oc­cu­pies just a small cor­ner in a vast space of pos­si­ble net­work mod­els. I sug­gest that ad­e­quate mod­els of cog­ni­tion most likely lie in a differ­ent, less ex­plored part of the space of pos­si­ble mod­els. Whether or not you agree with my spe­cific pro­pos­als, I hope that you will at least see the value of ex­plor­ing a broader range of pos­si­ble mod­els. Con­nec­tion­ism need not just be about back­prop­a­ga­tion and em­piri­cism. Taken more broadly, it could well help us an­swer the twin ques­tions of what the mind’s ba­sic build­ing blocks are and how those build­ing blocks can be im­ple­mented in the brain.

What is Gary Mar­cus do­ing in 2019? He has be­come a polemi­cist against “deep learn­ing.” He has en­gaged in long-run­ning wars of words, on Face­book and twit­ter and the de­bate cir­cuit, with a num­ber of “deep learn­ing” pi­o­neers, most no­tably Yann LeCun – the in­ven­tor of the CNN, one of the first big break­throughs in adding in­nate struc­ture to move be­yond the gen­er­al­iza­tion limits of the bad “con­nec­tion­ist”-style mod­els.

Here is Gary Mar­cus in Septem­ber 2019, tak­ing aim at GPT-2 speci­fi­cally, af­ter cit­ing a spe­cific con­tinu­a­tion-from-prompt that flouted com­mon sense:

Cur­rent AI sys­tems are largely pow­ered by a statis­ti­cal tech­nique called deep learn­ing, and deep learn­ing is very effec­tive at learn­ing cor­re­la­tions, such as cor­re­la­tions be­tween images or sounds and la­bels. But deep learn­ing strug­gles when it comes to un­der­stand­ing how ob­jects like sen­tences re­late to their parts (like words and phrases).

Why? It’s miss­ing what lin­guists call com­po­si­tion­al­ity: a way of con­struct­ing the mean­ing of a com­plex sen­tence from the mean­ing of its parts. For ex­am­ple, in the sen­tence “The moon is 240,000 miles from the Earth,” the word moon means one spe­cific as­tro­nom­i­cal ob­ject, Earth means an­other, mile means a unit of dis­tance, 240,000 means a num­ber, and then, by virtue of the way that phrases and sen­tences work com­po­si­tion­ally in English, 240,000 miles means a par­tic­u­lar length, and the sen­tence “The moon is 240,000 miles from the Earth” as­serts that the dis­tance be­tween the two heav­enly bod­ies is that par­tic­u­lar length.

Sur­pris­ingly, deep learn­ing doesn’t re­ally have any di­rect way of han­dling com­po­si­tion­al­ity; it just has in­for­ma­tion about lots and lots of com­plex cor­re­la­tions, with­out any struc­ture. It can learn that dogs have tails and legs, but it doesn’t know how they re­late to the life cy­cle of a dog. Deep learn­ing doesn’t rec­og­nize a dog as an an­i­mal com­posed of parts like a head, a tail, and four legs, or even what an an­i­mal is, let alone what a head is, and how the con­cept of head varies across frogs, dogs, and peo­ple, differ­ent in de­tails yet bear­ing a com­mon re­la­tion to bod­ies. Nor does deep learn­ing rec­og­nize that a sen­tence like “The moon is 240,000 miles from the Earth” con­tains phrases that re­fer to two heav­enly bod­ies and a length.

“Sur­pris­ingly, deep learn­ing doesn’t re­ally have any di­rect way of han­dling com­po­si­tion­al­ity.” But the whole point of The Alge­braic Mind was that it doesn’t mat­ter whether some­thing im­ple­ments a sym­bol-ma­nipu­lat­ing pro­cess trans­par­ently or opaquely, di­rectly or in­di­rectly – it just mat­ters whether or not it im­ple­ments it, full stop.

GPT-2 can fuck­ing write. (BTW, since we’ve touched on the topic of lin­guis­tic nu­ance, I claim the ex­ple­tive is cru­cial to my mean­ing: it’s one thing to merely put some rule-com­pli­ant words down on a page and an­other to fuck­ing write, if you get my drift, and GPT-2 does both.)

This should count as a large quan­tity of ev­i­dence in fa­vor of the claim that, what­ever nec­es­sary con­di­tions there are for the abil­ity to fuck­ing write, they are in fact satis­fied by GPT-2′s ar­chi­tec­ture. If com­po­si­tion­al­ity is nec­es­sary, then this sort of “deep learn­ing” im­ple­ments com­po­si­tion­al­ity, even if this fact is not su­perfi­cially ob­vi­ous from its struc­ture. (The last clause should go with­out say­ing to a reader of The Alge­braic Mind, but ap­par­ently needs ex­plicit spel­ling out in 2019.)

On the other hand, if “deep learn­ing” can­not do com­po­si­tion­al­ity, then com­po­si­tion­al­ity is not nec­es­sary to fuck­ing write. Now, per­haps that just means you can run with­out walk­ing. Per­haps GPT-2 is a bizarre blind alley pass­ing through an ex­tremely vir­tu­osic kind of simu­lated com­pe­tence that will, de­spite ap­pear­ances, never quite lead into real com­pe­tence.

But even this would be an im­por­tant dis­cov­ery – the dis­cov­ery that huge swaths of what we con­sider most es­sen­tial about lan­guage can be done “non-lin­guis­ti­cally.” For ev­ery easy test that chil­dren pass and GPT-2 fails, there are hard tests GPT-2 passes which the schol­ars of 2001 would have thought far be­yond the reach of any near-fu­ture ma­chine. If this is the con­clu­sion we’re draw­ing, it would im­ply a kind of para­noia about true lin­guis­tic abil­ity, an in­sis­tence that one can do so much of it so well, can learn to write like spook­ily like Nabokov (or like me) given 12 books and 6 hours to chew on them … and yet still not be “the real thing,” not even a lit­tle bit. It would im­ply that there are lan­guage-like be­hav­iors out there in log­i­cal space which aren’t lan­guage and which are nonethe­less so much like it, non-triv­ially, beau­tifully, spine-chillingly like it.

There is no read­ing of the situ­a­tion I can con­trive in which we do not learn at least one very im­por­tant thing about lan­guage and the mind.

(vii.)

Who cares about “lan­guage and the mind” any­more, in 2019?

I did, as a teenager in the 2000s. Gary Mar­cus and Steven Pinker did, back then. And I still do, even though – in a char­ac­ter­is­ti­cally 2019 turn-of-the-ta­bles – I am sup­posed to be some­thing like an “AI re­searcher,” and not a psy­chol­o­gist or lin­guist.

What are the schol­ars of lan­guage and the mind talk­ing about these days? They are talk­ing about AI. They are say­ing GPT-2 isn’t the “right path” to AI, be­cause it has so many gaps, be­cause it doesn’t look like what they imag­ined the nice, step-by-step, sym­bol-ma­nipu­lat­ing, hu­man-child­hood-imi­tat­ing path to AI would look like.

GPT-2 doesn’t know any­thing. It doesn’t know that words have refer­ents. It has no com­mon sense, no in­tu­itive physics or psy­chol­ogy or causal mod­el­ing, apart from the simu­la­tions of these things cheap enough to build in­side of a word-pre­dic­tion en­g­ine that has never seen or heard a dog, only the let­ters d-o-g (and c-a-n-i-n-e, and R-o-t-t-w-e-i-l-e-r, and so forth).

And yet it can fuck­ing write.

The schol­ars of lan­guage and the mind say: “this isn’t ‘the path to AI’. Why, it doesn’t know any­thing! It runs be­fore it can walk. It reads with­out talk­ing, speaks with­out hear­ing, opines about Obama with­out ever hav­ing gur­gled at the mo­bile posed over its crib. Don’t trust the hype ma­chine. This isn’t ‘in­tel­li­gence.’”

And I, an “AI re­searcher,” say: “look, I don’t care about AI. The thing can fuck­ing write and yet it doesn’t know any­thing! We have a model for like 100 differ­ent com­plex lin­guis­tic be­hav­iors, at once, in­te­grated cor­rectly and with gusto, and ap­par­ently you can do all that with­out ac­tu­ally know­ing any­thing or hav­ing a world-model, as long as you have this one spe­cial kind of com­pu­ta­tional ar­chi­tec­ture. Like, holy shit! Stop the presses at MIT Press! We have just learned some­thing in­cred­ibly cool about lan­guage and the mind, and some­one should study it!”

And the schol­ars of lan­guage and the mind go off and de­bate Yann LeCun and Yoshua Ben­gio on the topic of whether “deep learn­ing” is enough with­out in­cor­po­rat­ing com­po­nents that look ex­plic­itly “sym­bolic.” Back in 2001, Mar­cus (cor­rectly) ar­gued that the bad, prim­i­tive con­nec­tion­ist ar­chi­tec­tures of the time of­ten did ma­nipu­late sym­bols, some­times with­out their cre­ators re­al­iz­ing it. Now the suc­ces­sors of the “con­nec­tion­ist” mod­els, hav­ing ex­per­i­mented with in­nate struc­ture just like Mar­cus said they should, can do things no one in 2001 even dreamed of … and some­how, ab­surdly, we’ve for­got­ten the in­sight that a model can be sym­bolic with­out look­ing sym­bolic. We’ve gone from at­tribut­ing sym­bol-ma­nipu­la­tion pow­ers to vanilla em­piri­cist mod­els that sucked, to deny­ing those pow­ers to much more na­tivist mod­els that can fuck­ing write.

What hap­pened? Where did the psy­chol­in­guists go, and how can I get them back?

Here is Steven Pinker in 2019, ex­plain­ing why he is unim­pressed with GPT-2′s “su­perfi­cially plau­si­ble gob­bledy­gook”:

Be­ing am­nesic for how it be­gan a phrase or sen­tence, it won’t con­sis­tently com­plete it with the nec­es­sary agree­ment and con­cord – to say noth­ing of se­man­tic co­her­ence. And this re­veals the sec­ond prob­lem: real lan­guage does not con­sist of a run­ning monologue that sounds sort of like English. It’s a way of ex­press­ing ideas, a map­ping from mean­ing to sound or text. To put it crudely, speak­ing or writ­ing is a box whose in­put is a mean­ing plus a com­mu­nica­tive in­tent, and whose out­put is a string of words; com­pre­hen­sion is a box with the op­po­site in­for­ma­tion flow.

“Real lan­guage does not con­sist of a run­ning monologue that sounds sort of like English.” Ex­cuse me? Does the English past tense not mat­ter any­more? Is mor­phosyn­tax noth­ing? Style, tone, nu­ances of dic­tion, tics of punc­tu­a­tion? Have you just given up on study­ing lan­guage qua lan­guage the way Chom­sky did, just con­ceded that whole thing to the evil “deep learn­ing” peo­ple with­out say­ing so?

Aren’t you a sci­en­tist? Aren’t you cu­ri­ous? Isn’t this fas­ci­nat­ing?

Hello? Hello? Is there any­one in here who can pro­duce novel thoughts and not just gar­bled re­gur­gi­ta­tions of out­dated aca­demic dis­course? Or should I just go back to talk­ing to GPT-2?