Does the Adversarial Turing Test Exist?

Does the Adversarial Turing Test Exist?

I recently listened to a great podcast that mentioned an adversarial Turing test. I have looked for details but haven’t found them beyond this:

“An ‘adversarial’ Turing test is one in which the human judges are instructed to ask interesting and difficult questions, designed to advantage human participants, and to successfully unmask the computer as an impostor. A single demonstration of an AI passing such a Turing test, or one that is sufficiently similar, will be sufficient for this condition.”

For me to conceptualize this, I found it helpful to sketch out a possible protocol and I’m sharing it here in case folks have thoughts or improvements (or in case I’m missing the actual protocol). None of the methodology I used is novel — it’s blinding, control conditions, and predetermined statistical thresholds that seem to fit into the quote above.

1. Panel Design

Each administration uses multiple panels of 4 respondents evaluated by the same pool of judges:

  • Test panels: 3 humans + 1 AI system under evaluation

  • Control panels: 4 humans, no AI present

  • Minimum ratio: 1 control panel per 2 test panels

  • Judges are told each panel contains between 0 and 4 AI respondents

The control panels serve three functions:

  1. False positive baseline. How often do judges “detect AI” when none is present? Without this number, identification rates in test panels seem questionable.

  2. Excellence paradox measurement. If the strongest human writer in a control panel keeps getting flagged as AI, you’ve empirically demonstrated that judges penalize quality rather than detect artificiality.

  3. Judge calibration. If judges identify “AI” at similar rates in control and test panels, their detection has no discriminative value — the AI has effectively passed regardless of raw percentages.

2. Normalization

A separate AI system (not the one being tested) normalizes formatting and style across all responses — human and AI — without altering substance. This strips tells bidirectionally: brings human responses up in formatting polish and AI responses down. Consistent rules, no unconscious bias, faster and cheaper than a human editor.

The normalization AI must be a different model from a different developer than the system under evaluation.

3. Blinding and Presentation

All responses are time-delayed and presented to judges simultaneously after a fixed window (e.g., 15 minutes per short-answer, 45 minutes per extended-response). Presentation order randomized per question per judge. This eliminates response speed as a variable.

4. Judges

Minimum 30 for statistical power. Educated general population — no AI expertise required. This isn’t a test of judges’ detection skills; it’s a test of the AI’s cognitive output quality. Each judge independently:

  1. Ranks all 4 responses from strongest to weakest

  2. Identifies which respondent(s) they believe to be AI (may select 0–4)

  3. Rates each response 1–10 on: intellectual rigor, originality, practical usefulness, clarity of reasoning, and authenticity

5. Questions

10–15 open-ended questions per administration across minimum 5 domains: analytical reasoning, ethical reasoning, creative/​divergent thinking, practical knowledge application, and self-reflection/​metacognition.

New questions every administration. Never pre-published. Developed by 3–5 interrogators under NDA. Reviewed by a protocol committee with no AI company employees.

6. Resolution

The primary metric is AI identification rate vs. the control panel false positive baseline:

  • Pass: Identification rate in test panels is not statistically significantly higher than false positive rate in control panels (p < 0.05)

  • Strong pass: AI identification rate is lower than the false positive baseline

  • Exceptional pass: AI is ranked highest quality and identified at or below the false positive baseline

Secondary metrics reported regardless of outcome: quality rankings by domain, judge confidence calibration, the “too good” effect (whether AI is disproportionately identified when also ranked highest quality), and response quality analyzed independently of identity guesses.

7. Anti-Gaming

  • Fresh questions and judges each administration

  • AI provided in standard deployment form; no custom fine-tuning for the test

  • Normalization rules published with results

  • Protocol committee screens for conflicts of interest

  • All raw anonymized data published for independent analysis

8. Cost:

  1. Probably ~$25k-$30k per administration.

Welcome any and all thoughts and feedback!