The Diligent Turing Test

Researchers at UC San Diego recently authored the paper “Large Language Models Pass the Turing Test.” As they describe it, the central question of the Turing Test is testing whether AI is “distinguishable from humans.” Ancient philosophical epistemology debates focused greatly on the idea of indistinguishable “impressions” (sensations) and their conclusions may prove useful in improving the Turing Test, which relies on distinguishability. Of particular interest is the the philosopher Antiochus of Ascalon, whose (entirely lost) writings are preserved in sources such as On Academic Scepticism by Cicero. In this post, I will rely on reasoning put forth by Antiochus to argue that researchers conducting a Turing Test should not seek random pools of research participants (such as UCSD undergrads) but should instead recruit human judges with a lot of familiarity with AI—even going as far as to encourage them to research the subject before the trial or incorporate feedback to become better at their task.

As pointed out here, the Turing Test could plausibly be lackluster if AI was intelligent but not human-like. Discussion of whether the Turing Test is appropriate should occur, but this post is focused on the idea of indistinguishability its implications for conducting real-life Turing Tests.

On Academic Scepticism describes a debate between Lucullus, a student of Antiochus, and Cicero, the Skeptic author. Lucullus argues for the possibility of knowledge on the premise that a true impression can always be distinguished from a false one and Cicero contests this point. The example of eggs is brought up by Lucullus, who says that although eggs seem indistinguishable from each other, someone with enough knowledge and familiarity with the subject, in this case chicken farmers, are able to distinguish between them. Cicero responds,

Everything has its own kind, nothing is identical with something else, you say. It’s certainly the Stoic view, and not a particularly credible one, that no strand of hair in the world is just like another, nor any grain of sand. I could refute this view, but I have no desire to put up a fight. It doesn’t matter, for our purposes, whether the objects of our impressions don’t differ at all or can’t be discriminated, even if they do differ. Still, if there can’t be such similarity between people, what about between statues? Are you saying that Lysippus couldn’t have made a hundred Alexanders just like one another, if he used the same bronze, the same process, the same tool, etc.? Tell me what marking you would have used to differentiate them! How about if I stamp a hundred seals into wax of the same type with this ring? Are you really going to be able to find a means of distinguishing them? Or will you need to find a ring-maker like that Delian chicken-farmer you found who could recognize eggs? But you appeal to technical skill even in support of the senses. A painter can see details we can’t; an expert recognizes the song at the first notes of the flute. So what? Doesn’t this tell against you, if we can’t see or hear without complex skills to which few can aspire (at least in this country)?

(Cicero, On Academic Scepticism, secs. 85–87).

Cicero here does concede that experts can often distinguish between things that laymen cannot. He also did not foresee that DNA testing could assist humans in distinguishing hairs and X-rays could be used to compare the chemical composition of grains of sand. This pattern, that distinguishability increases with knowledge and technology, should be applied to our evaluation of AI.

Imagine for a moment that a child was given the role of judge in the Turing Test. They ask the AI a question and in its response it references “sneetches.” The child, being familiar with Dr. Suess and unaware that the average human would likely not use this word, guesses incorrectly that the AI is human. From this thought experiment it is clear that human judges would be more successful than a child.

Now imagine that tomorrow, every government on Earth outlawed any changes be made to current LLMs but allowed current models to be used by the public as long as they remain unchanged. (Suppose too that everyone complied.) The UCSD study found “win-rates” for LLMs to range between 21% and 73%. It is my hypothesis that if LLMs were frozen in time and this study were repeated year after year, the win-rates of each model would decline steadily as the public becomes more and more familiar with the tendencies of AI. Even assuming that AI will continuously improve, I believe that for a machine to pass the Turing Test, it must be immune to this scenario.

Finally, suppose that the use of LLMs were outlawed for anyone besides qualified researchers. Then, two groups of judges were used to perform the Turing Test on some machine: researchers and non-researchers. Now assume the win-rate is high against the non-researchers and low against the researchers. Did the machine pass the Turing Test? I argue, no. We see the same pattern as the last two examples: distinguishability relies heavily on the expertise of the judges.

Following these thought experiments, it is evident why Turing Tests should employ experts. Testing whether the average person (or a smart but non-knowledgable person) can distinguish between AI and human is a different question than testing whether they can be distinguished. The latter, in my mind, is a more interesting question and is best approximated by employing experts, not laymen. This argument closely aligns to Lucullus’s: it doesn’t matter much whether the average person can tell eggs apart; we should be testing the chicken farmers.