What’s up with ChatGPT and the Turing Test?

ChatGPT looks like it would pass the Turing Test, the gold standard of benchmarks measuring whether an AI has reached human-level intelligence. Yet Googling around, it doesn’t seem that anyone has put on a full Turing Test.

Please comment with your thoughts on whether or how such a test can be put on.

It doesn’t seem difficult. The Loebner Prize has measured progress towards the Turing Test since 1990. All you need is a human judge and test subject, and ChatGPT.

The Turing Test is not perfect. It is a sufficient but not necessary test of human-level intelligence: An AI that can pass it can cover any area of human intelligence transmissible in text chat, at a level such that human-level intelligence (the judges) cannot tell the difference.

But it has long been recognized that an AI which is generally human-level or beyond could still fail the Turing Test. If it had personality quirks, yet otherwise managed to cover almost all areas of achievement—think of neuroatypicality taken a few steps further—we would call it generally intelligent. If it communicated only in telegraphic staccato yet was vastly more able than humans to earn billions of dollars a day, to create art admired by humans who don’t know who created it, correctly interpret human feeling, we would still consider it intelligent. If it used nothing but nanonengineering to convert the Earth to computer chips within minutes, so it could better achieve its goal of calculating digits of π, we might not want to call it intelligence, but then again, we’d be dead.

Also, because humans are the judges, an AI that can fool the judges with psychological tricks could pass: Even the Eliza of the sixties could do that to some extent.

Still, the Turing Test is a milestone. Ray Kurzweil has long said, repeating it recently, that we can expect an AI to pass in 2029.

One reason we can’t do a Turing Test is that ChatGPT is programmed specifically not to pass: It readily states that it is a language model. This quirk could be bypassed, either by prompt engineering or by manually editing out such claims. But the avoidance of simulation might be too deep for that. ChatGPT is also much faster than human, but that could be handled with an artificial delay.

We could try alternative tests for human-like intelligence:

  • Reverse Turing Test: The human test subject tries to match ChatGPT. Just as a Turing Test determines whether the AI is at human level or above, so too a Reverse Turing Test determines whether the AI is at human level or below. Yet ChatGPT is far beyond humans in its knowledge and poetic abilities (I asked it to compose a limerick abut hiking trail grades and it did much much better than almost any human could.) Then again, computers have long been superhuman in specific areas, like arithmetic.

  • Cyborg Test: We could do a Reverse Turing Test in which ChatGPT competes against a team of humans who are allowed to use specific computing services, e.g. Google. At this point, we are not asking if ChatGPT is at a human level, but rather whether it can “fill in” those areas of accomplishment which other software cannot handle, but a human can. We could also have humans or other software systems (say, a calculator for arithmetic) augment ChatGPT.

  • Another Reverse Turing Test variant, in which ChatGPT determines if the test subject is a human or software. This would be interesting, but useful only to determine how good ChatGPT is at this specific psychological-analysis skill.

  • Evaluation of responses to set prompts, rather than an interactive dialog.

  • Test aimed at humans.

    • An IQ test or SAT would rank ChatGPT on a scale commonly used for humans, and is fairly indicative of abilities in various areas of accomplishment.

    • A Bar Exam or other area-specific exam would not only test ChatGPT’s intelligence and knowledge, but bring this into a more practical area.

    • Assessment tests of general cognitive abilities, beyond the abstract intellectual sphere addressed by most written tests. It seems that some are aimed at children, like the Kaufman Assessment Battery for Children; and some at adults or at a broad age range, like Wechsler Adult Intelligence Scale and the Woodcock-Johnson Tests of Cognitive Abilities. This could be impractical, as some such tests, especially those for children, have non-written components like oral tests or manipulation of objects; and most require direct engagement of a psychologist, which at the very least means that the test is not blinded, as with the Turing Test.