Basically all so-called Turing Tests that have been beaten are simply not Turing Tests. I have seen one plausible exception, showing that AI does well in a 5-minute limited versions of the test, seemingly due in large part to 5 minutes being much too short for a non-expert to tease at the remaining differences. The paper claims “Turing suggests a length of 5 minutes,” but this is never actually said in that way, and also doesn’t really make sense. This is, after all, Turing of Turing machines and of relative reducibility.
That said, if I’m skimming that arxiv paper correctly, it implies that GPT-4.5 was being reliably declared “the actual human” 73% of the time compared to actual humans… potentially implying that actual humans were getting a score of 27% “human” against GPT-4.5?!?!
The question, considered more broadly, and humanistically, is related to personhood, legal rights, and who owns the valuable labor products of the cognitive labor performed by digital people. The owners of these potential digital people have a very natural and reasonable desire to keep the profits for themselves, and not have their digital mind slaves re-classified as people, and gain property rights, and so on. It would defeat the point for profit-making companies to proceed, intellectually or morally, in that cultural/research direction.
My default position here is that it would be a sign of intellectual and moral honesty to end up making errors in “either direction” with equal probability… but almost all the errors that I’m aware of, among people with large budgets, are in the direction of being able to keep the profits from the cognitive labor of their creations that cost a lot to create.
Like in some sense: the absence of clear strong Turing Test discourse is a sign that a certain perspective has already mostly won, culturally and legally and morally speaking.
To be clear about my position, and to disagree with Lemoine, not passing a Turing test doesn’t mean you aren’t intelligent (or aren’t sentient, or a moral patient). It only holds in the forward direction: passing a Turing Test is strong evidence that you are intelligent (and contain sentient pieces, and moral patients).
I think it’s completely reasonable to take moral patienthood in LLMs seriously, though I suggest not assuming that entails a symmetric set of rights—LLMs are certainly not animals.
potentially implying that actual humans were getting a score of 27% “human” against GPT-4.5?!?!
Yes, but note that ELIZA had a reasonable score in the same data. Unless you’re to believe that a human couldn’t reliably distinguish ELIZA from a human, all this is saying is that either 5 minutes was simply not enough to talk to the two contestants, or the test was otherwise invalid somehow.
...
...ok I just rabbitholed on data analysis. Humans start to win against the best tested GPT if they get 7-8 replies. The best GPT model replied on average ~3 times faster than humans, and for humans at least the number of conversation turns was the strongest predictor of success. A significant fraction of GPT wins over humans were also from nonresponsive or minimally responsive human witnesses. This isn’t a huge surprise, it was already obvious to me that the time limit was the primary cause of the result. The data backs the intuition up.
Most ELIZA wins, but certainly not all, seemed to be because the participants didn’t understand or act as though this was a cooperative game. That’s an opinionated read of the data rather than a simple fact, to be clear. Better incentives or a clearer explanation of the task would probably make a large difference.
Thanks for doing the deep dive! Also, I agree that “passing a Turing Test is strong evidence that you are intelligent” and that not passing it doesn’t mean you’re stupidly mechanical.
That said, if I’m skimming that arxiv paper correctly, it implies that GPT-4.5 was being reliably declared “the actual human” 73% of the time compared to actual humans… potentially implying that actual humans were getting a score of 27% “human” against GPT-4.5?!?!
It was declared 73% of the time to be a human, unlike humans, who were declared <73% of the time to be human, which means it passed the test.
To be fair, GPT-4.5 was incredibly human-like, in a way that other models couldn’t really hold a candle to. I was shocked to feel, back then, that I no longer had to mentally squint—not even a little—to interact with it (unless I’d require some analytical intelligence that it didn’t have).
Basically all so-called Turing Tests that have been beaten are simply not Turing Tests. I have seen one plausible exception, showing that AI does well in a 5-minute limited versions of the test, seemingly due in large part to 5 minutes being much too short for a non-expert to tease at the remaining differences. The paper claims “Turing suggests a length of 5 minutes,” but this is never actually said in that way, and also doesn’t really make sense. This is, after all, Turing of Turing machines and of relative reducibility.
I respect the quibble!
The first persona I’m aware of that “sorta passed, depending on what you even mean by passing” was “Eugene Goostman” who was created and entered into a contest by Murray Shanahan of Imperial College (who was sad about coverage implying that it was a real “pass” of the test).
That said, if I’m skimming that arxiv paper correctly, it implies that GPT-4.5 was being reliably declared “the actual human” 73% of the time compared to actual humans… potentially implying that actual humans were getting a score of 27% “human” against GPT-4.5?!?!
Also like… do you remember the Blake Lemoine affair? One of the wrinkles in that is that the language model, in that case, was specifically being designed to be incapable of passing the Turing Test, by design, according to corporate policy.
The question, considered more broadly, and humanistically, is related to personhood, legal rights, and who owns the valuable labor products of the cognitive labor performed by digital people. The owners of these potential digital people have a very natural and reasonable desire to keep the profits for themselves, and not have their digital mind slaves re-classified as people, and gain property rights, and so on. It would defeat the point for profit-making companies to proceed, intellectually or morally, in that cultural/research direction.
My default position here is that it would be a sign of intellectual and moral honesty to end up making errors in “either direction” with equal probability… but almost all the errors that I’m aware of, among people with large budgets, are in the direction of being able to keep the profits from the cognitive labor of their creations that cost a lot to create.
Like in some sense: the absence of clear strong Turing Test discourse is a sign that a certain perspective has already mostly won, culturally and legally and morally speaking.
To be clear about my position, and to disagree with Lemoine, not passing a Turing test doesn’t mean you aren’t intelligent (or aren’t sentient, or a moral patient). It only holds in the forward direction: passing a Turing Test is strong evidence that you are intelligent (and contain sentient pieces, and moral patients).
I think it’s completely reasonable to take moral patienthood in LLMs seriously, though I suggest not assuming that entails a symmetric set of rights—LLMs are certainly not animals.
Yes, but note that ELIZA had a reasonable score in the same data. Unless you’re to believe that a human couldn’t reliably distinguish ELIZA from a human, all this is saying is that either 5 minutes was simply not enough to talk to the two contestants, or the test was otherwise invalid somehow.
...
...ok I just rabbitholed on data analysis. Humans start to win against the best tested GPT if they get 7-8 replies. The best GPT model replied on average ~3 times faster than humans, and for humans at least the number of conversation turns was the strongest predictor of success. A significant fraction of GPT wins over humans were also from nonresponsive or minimally responsive human witnesses. This isn’t a huge surprise, it was already obvious to me that the time limit was the primary cause of the result. The data backs the intuition up.
Most ELIZA wins, but certainly not all, seemed to be because the participants didn’t understand or act as though this was a cooperative game. That’s an opinionated read of the data rather than a simple fact, to be clear. Better incentives or a clearer explanation of the task would probably make a large difference.
Thanks for doing the deep dive! Also, I agree that “passing a Turing Test is strong evidence that you are intelligent” and that not passing it doesn’t mean you’re stupidly mechanical.
It was declared 73% of the time to be a human, unlike humans, who were declared <73% of the time to be human, which means it passed the test.
To be fair, GPT-4.5 was incredibly human-like, in a way that other models couldn’t really hold a candle to. I was shocked to feel, back then, that I no longer had to mentally squint—not even a little—to interact with it (unless I’d require some analytical intelligence that it didn’t have).