On the Loebner Silver Prize (a Turing test)

The Metaculus question “When will the first weakly general AI system be devised, tested, and publicly announced?” has four resolution criteria, and in my opinion, reliably passing a Loebner-Silver-prize-equivalent Turing test is the hardest, since it is the only one that is adversarial. (See Wikipedia for background on the Loebner Prize.)

What’s involved in winning the Loebner Silver Prize?

Though the Loebner Prize is no longer being run, we can get information about its format from the Wayback Machine.
https://web.archive.org/web/20190220020219/https://aisb.org.uk/events/loebner-prize

The link covers up to the 2018 prize. According to Wikipedia, the competition was held once more in 2019, with a dramatically different format (notably not involving judges), and then discontinued. I think the resolution criterion is intended to refer to a more traditional format with judges.

There are two parts to the contest: the selection process and the finals. The selection process is only relevant for deciding which bots to include in the finals, but it’s nonetheless interesting to read its transcript: https://web.archive.org/web/20181022113601/https://www.aisb.org.uk/media/files/LoebnerPrize2018/Transcripts_2018.pdf. The bot that would go on to win the bronze medal, Mitsuku, by today’s standards would be considered very bad. The selection process consists of 20 pre-decided questions with no follow-up, and my opinion is that an appropriately fine-tuned GPT-4 would likely be indistinguishable from human on these questions.

However, the finals format is much more difficult. Format details:

Four judges, four bots, four humans, four rounds.
In each round, a judge is paired with one bot and one human, and there is 25 minutes of questioning.
The questioning is in instant-messaging style, much like https://www.humanornot.ai/.
To win the silver medal, the system must fool half the judges. (It seems to me that even a perfect bot would lose sometimes due to random chance, and I don’t know how they account for that.)

It is not clear how much the judges and humans know about each other. If the judges know who the human confederates are, that would make the contest considerably harder.

Mitsuku’s creator, Steve Worswick, wrote a recap of his 2018 win, and also posted Mitsuku’s finals transcripts.

How hard is it to win?

Though Mitsuku won the bronze prize in 2018, it was clearly nowhere close to winning the silver prize, based on reading its transcripts, but that’s unsurprising given that was in the pre-GPT-3 era.

How much better would today’s technology perform? My experience with https://www.humanornot.ai/ was that even just 2 minutes of questioning by an unpracticed judge can typically distinguish human and bot, unless the human is deliberately pretending to be a bot. The Loebner finals, by contrast, involve 100 minutes of questioning by expert judges.

I can’t emphasize enough how much harder the adversarial format makes it. If the bot has any weak point, you can tailor your questioning towards that weak point.

There is also a tricky issue of not showing too much capability, which Worswick discusses in his post.

Being humanlike is not the same as being intelligent though. If I were to ask you what the population of Norway is and you gave me the exact answer, I wouldn’t think that was very humanlike. A more human response would be something like, “no idea” and although this is certainly more human, it is neither intelligent or useful.

I’d guess that, if you have a bot that in all other respects can pass as human, this shortcoming could be addressed relatively easily by fine-tuning or maybe even prompt engineering alone. However, it does mean that an out-of-the-box system would fail.

Conclusion

At the time of writing, the Metaculus community predicts that in July 2024 there will be a 25% chance of a system of Loebner-silver-prize capability (along with the other resolution criteria). It is hard for me to imagine how this could happen.

It’s too bad that the Loebner prize is no longer held. It would’ve been a notable milestone for a bot to get a perfect score on the selection questions, which seems plausible with current technology, with questions comparable to the 2018 ones. Seeing progress in the finals would also have helped with understanding how far we are from a silver medal win.