[QUESTION]: Academic social science and machine learning

I asked this ques­tion on Face­book here, and got some in­ter­est­ing an­swers, but I thought it would be in­ter­est­ing to ask LessWrong and get a larger range of opinions. I’ve mod­ified the list of op­tions some­what.

What ex­plains why some clas­sifi­ca­tion, pre­dic­tion, and re­gres­sion meth­ods are com­mon in aca­demic so­cial sci­ence, while oth­ers are com­mon in ma­chine learn­ing and data sci­ence?

For in­stance, I’ve en­coun­tered pro­bit mod­els in some aca­demic so­cial sci­ence, but not in ma­chine learn­ing.

Similarly, I’ve en­coun­tered sup­port vec­tor ma­chines, ar­tifi­cial neu­ral net­works, and ran­dom forests in ma­chine learn­ing, but not in aca­demic so­cial sci­ence.

The main al­gorithms that I be­lieve are com­mon to aca­demic so­cial sci­ence and ma­chine learn­ing are the most stan­dard re­gres­sion al­gorithms: lin­ear re­gres­sion and lo­gis­tic re­gres­sion.

Pos­si­bil­ities that come to mind:

(0) My ob­ser­va­tion is wrong and/​or the whole ques­tion is mis­guided.

(1) The fo­cus in ma­chine learn­ing is on al­gorithms that can perform well on large data sets. Thus, for in­stance, pro­bit mod­els may be aca­dem­i­cally use­ful but don’t scale up as well as lo­gis­tic re­gres­sion.

(2) Aca­demic so­cial sci­en­tists take time to catch up with new ma­chine learn­ing ap­proaches. Of the meth­ods men­tioned above, ran­dom forests and sup­port vec­tor ma­chines was in­tro­duced as re­cently as 1995. Neu­ral net­works are older but their prac­ti­cal im­ple­men­ta­tion is about as re­cent. More­over, the prac­ti­cal im­ple­men­ta­tions of these al­gorithm in the stan­dard statis­ti­cal soft­wares and pack­ages that aca­demics rely on is even more re­cent. (This re­lates to point (4)).

(3) Aca­demic so­cial sci­en­tists are fo­cused on pub­lish­ing pa­pers, where the goal is gen­er­ally to de­ter­mine whether a hy­poth­e­sis is true. There­fore, they rely on ap­proaches that have clear rules for hy­poth­e­sis test­ing and for es­tab­lish­ing statis­ti­cal sig­nifi­cance (see also this post of mine). Many of the new ma­chine learn­ing ap­proaches don’t have clearly defined statis­ti­cal ap­proaches for sig­nifi­cance test­ing. Also, the strength of ma­chine learn­ing ap­proaches is more ex­plo­ra­tory than test­ing already for­mu­lated hy­pothe­ses (this re­lates to point (5)).

(4) Some of the new meth­ods are com­pli­cated to code, and aca­demic so­cial sci­en­tists don’t know enough math­e­mat­ics, com­puter sci­ence, or statis­tics to cope with the meth­ods (this may change if they’re taught more about these meth­ods in grad­u­ate school, but the rel­a­tive new­ness of the meth­ods is a fac­tor here, re­lat­ing to (2)).

(5) It’s hard to in­ter­pret the re­sults of fancy ma­chine learn­ing tools in a man­ner that yields so­cial sci­en­tific in­sight. The re­sults of a lin­ear or lo­gis­tic re­gres­sion can be in­ter­preted some­what in­tu­itively: the pa­ram­e­ters (co­effi­cients) as­so­ci­ated with in­di­vi­d­ual fea­tures de­scribe the ex­tent to which those fea­tures af­fect the out­put vari­able. Mo­dulo is­sues of fea­ture scal­ing, larger co­effi­cients mean those fea­tures play a big­ger role in de­ter­min­ing the out­put. Pair­wise and list­wise R^2 val­ues provide ad­di­tional in­sight on how much sig­nal and noise there is in in­di­vi­d­ual fea­tures. But if you’re look­ing at a neu­ral net­work, it’s quite hard to in­fer hu­man-un­der­stand­able rules from that. (The op­po­site di­rec­tion is not too hard: it is pos­si­ble to con­vert hu­man-un­der­stand­able rules to a de­ci­sion tree and then to use a neu­ral net­work to ap­prox­i­mate that, and add ap­pro­pri­ate fuzzi­ness. But the neu­ral net­works we ob­tain as a re­sult of ma­chine learn­ing op­ti­miza­tion may be quite differ­ent from those that we can in­ter­pret as hu­mans). To my knowl­edge, there haven’t been at­tempts to rein­ter­pret neu­ral net­work re­sults in hu­man-un­der­stand­able terms, though Se­bas­tian Kwiatkowski’s com­ment on my Face­book post points to an ex­am­ple where the re­sults of naive Bayes and SVM clas­sifiers for ho­tel re­views could be trans­lated into hu­man-un­der­stand­able terms (namely, re­views that men­tioned phys­i­cal as­pects of the ho­tel, such as “small bed­room”, were more likely to be truth­ful than re­views that talked about the rea­sons for the visit or the com­pany that spon­sored the visit). But Kwiatkowski’s com­ment also pointed to other in­stances where the ma­chine’s al­gorithms weren’t hu­man-in­ter­pretable.

What’s your per­sonal view on my main ques­tion, and on any re­lated is­sues?