That said, if I’m skimming that arxiv paper correctly, it implies that GPT-4.5 was being reliably declared “the actual human” 73% of the time compared to actual humans… potentially implying that actual humans were getting a score of 27% “human” against GPT-4.5?!?!
It was declared 73% of the time to be a human, unlike humans, who were declared <73% of the time to be human, which means it passed the test.
It was declared 73% of the time to be a human, unlike humans, who were declared <73% of the time to be human, which means it passed the test.