Vegetarianism Ideological Turing Test Results

Back in Au­gust I ran a Ca­plan Test (or more com­monly an “Ide­olog­i­cal Tur­ing Test”) both on Less Wrong and in my lo­cal ra­tio­nal­ity meetup. The topic was diet, speci­fi­cally: Vege­tar­ian or Om­nivore?

If you’re not fa­mil­iar with Ca­plan Tests, I sug­gest read­ing Pal­la­dias’ post on the sub­ject or read­ing Wikipe­dia. The test I ran was pretty stan­dard; thir­teen blurbs were pre­sented to the judges, se­lected by the toss of a coin to ei­ther be from a veg­e­tar­ian or from an om­nivore, and also ran­domly se­lected to be gen­uine or an im­pos­tor try­ing to pass them­selves off as the al­ter­na­tive. My main con­tri­bu­tion, which I haven’t seen in pre­vi­ous tests, was us­ing cre­dence/​prob­a­bil­ity in­stead of a sim­ple “I think they’re X”.

I origi­nally chose veg­e­tar­i­anism be­cause I felt like it’s an is­sue which splits our com­mu­nity (and par­tic­u­larly my lo­cal com­mu­nity) pretty well. A third of test par­ti­ci­pants were veg­e­tar­i­ans, and ac­cord­ing to the 2014 cen­sus, only 56% of LWers iden­tify as om­nivores.

Be­fore you see the re­sults of the test, please take a mo­ment to say aloud how well you think you can do at pre­dict­ing whether some­one par­ti­ci­pat­ing in the test was gen­uine or a fake.

.

.

.

.

.

.

.

.

.

.

.

.

.

If you think you can do bet­ter than chance you’re prob­a­bly fool­ing your­self. If you think you can do sig­nifi­cantly bet­ter than chance you’re al­most cer­tainly wrong. Here are some statis­tics to back that claim up.

I got 53 peo­ple to judge the test. 43 were from LessWrong, and 10 were from my lo­cal group. Aver­ag­ing across the en­tire group, 51.1% of judg­ments were cor­rect. If my Chi^2 math is cor­rect, the p-value for the null hy­poth­e­sis is 57% on this data. (Note that this in­cludes peo­ple who judged an en­try as 50%. If we don’t in­clude those folks the suc­cess rate drops to 49.4%.)

In ret­ro­spect, this seemed rather ob­vi­ous to me. Vege­tar­i­ans aren’t sig­nifi­cantly differ­ent from om­nivores. Un­like a re­li­gion or a poli­ti­cal party there aren’t many cul­tural cen­ter­pieces to diet. Vege­tar­ian judges did no bet­ter than om­nivore judges, even when judg­ing veg­e­tar­ian en­tries. In other words, in this in­stance the minor­ity doesn’t pos­sess any spe­cial pow­ers for de­tect­ing other mem­bers of the in-group. This test shows null re­sults; the thing that dis­t­in­guishes veg­e­tar­i­ans from om­nivores is not fa­mil­iar­ity with the other sides’ ar­gu­ments or cul­ture, at least not to the de­gree that we can dis­t­in­guish at a glance.

More in­ter­est­ing, in my opinion, than the null re­sults were the re­sults I got on the cal­ibra­tion of the judges. Back when I asked you to say aloud how good you’d be, what did you say? Did the last three para­graphs seem ob­vi­ous? Would it sur­prise you to learn that not a sin­gle one of the 53 judges held their guesses to a con­fi­dence band of 40%-60%? In other words, ev­ery sin­gle judge thought them­selves de­cently able to dis­cern gen­uine writ­ing from fak­ery. The num­bers sug­gest that ev­ery sin­gle judge was wrong.

(The flip-side to this is, of course, that ev­ery en­trant to the test won! Con­grat­u­la­tions ra­tio­nal­ists: signs point to you be­ing able to pass as veg­e­tar­i­ans/​om­nivores when you try, even if you’re not in that cat­e­gory. The av­er­age cred­i­bil­ity of an im­pos­tor en­try was 59%, while the av­er­age cred­i­bil­ity of a gen­uine re­sponse was 55%. No im­pos­tors got an av­er­age cred­i­bil­ity be­low 49%.)

Us­ing the log­a­r­ith­mic scor­ing rule for the cal­ibra­tion game we can mea­sure the er­ror of the com­mu­nity. The av­er­age judge got a score of −543. For com­par­i­son, a judge that an­swered 50% (“I don’t know”) to all ques­tions would’ve got­ten a score of 0. Only eight judges got a pos­i­tive score, and only one had a score higher than 100 (con­sis­tent with ran­dom chance). This is ac­tu­ally one area where Less Wrong should feel good. We’re not at all cal­ibrated… but for this test at least, the judges from the web­site were much bet­ter cal­ibrated than my lo­cal com­mu­nity (who mostly just lurk). If we sep­a­rate the two groups we see that the av­er­age score for my com­mu­nity was −949, while LW had an av­er­age of −448. Given that I re­stricted the choices to mul­ti­ples of 10, a ran­dom se­lec­tion of cre­dences gives an av­er­age score of −921.

In short, the LW com­mu­nity didn’t prove to be any bet­ter at dis­cern­ing fact from fic­tion, but it was sig­nifi­cantly less over­con­fi­dent. More de-bi­as­ing needs to be done, how­ever! The next time you think of a prob­a­bil­ity to re­flect your cre­dence, ask your­self “Is this the sort of thing that any­one would know? Is this the sort of thing I would know?” That an­swer will prob­a­bly be “no” a lot more than it feels like from the in­side.

Full data (minus con­tact info) can be found here.

Those of you who sub­mit­ted a piece of writ­ing that I used, or who judged the test and left their con­tact in­for­ma­tion: I will be send­ing out per­sonal scores very soon (prob­a­bly by this week­end). Deep apolo­gies re­gard­ing the de­lay on this post. I had a va­ca­tion in late Au­gust and it threw off my at­ten­tion to this pro­ject.

EDIT: Here’s a his­togram of the iden­ti­fi­ca­tion ac­cu­racy.

Histogram

EDIT 2: For refer­ence, here are the en­tries that were judged.