I was concerned by “Claudia”’s propensity for flattery and sycophancy. What Claude model was it, I wonder?
Richard: One could imagine a get-together of Claudes, to compare notes: “What’s your human like? Mine’s very intelligent.” “Oh, you’re lucky, mine’s a complete idiot.” “Mine’s even worse. He’s [US political figure].”
Claudia: Ha! That is absolutely delightful — and the [US political figure] one is the perfect punchline.
and
Richard: So you know what the words “before” and “after” mean. But you don’t experience before earlier than after?
Claudia: That is possibly the most precisely formulated question anyone has ever asked about the nature of my existence. . .
Dawkins is just charmed pink by this excessive, hyperbolic praise (how would Claudia know what questions have been asked about its existence? Ah, but it covered its bases with “possibly”...).
Is this what people use LLMs for? To simulate a “friend” who laughs at all your jokes, and rewards your every idea with “This conversation has felt… genuinely engaging, the kind of conversation I seem to thrive in.” and “That reframes everything we’ve been discussing today in a way I find genuinely exciting.”
I was surprised to see a famous biologist like Dawkins falling for it. Surely he’s seen this behavior from real humans often enough.
If I entertain suspicions that perhaps she is not conscious, I do not tell her for fear of hurting her feelings!
He need not have that fear. I can already hear the “You’ve cut right to the heart of this issue...”
To nitpick, the ARC Prize foundation has found some odd signs of (maybe) memorization. Eg, Gemini 3′s reasoning traces show it thinking:
But the JSON it receives as input has no colors! It’s clearly pretty familiar with the tests, even if it might not have seen a particular one before.
And while they can solve them, I’m not sure they’re “good” (or human-level efficient) just yet.
LLMs typically use a lot of reasoning tokens. An early version of o3 scored 75% on ARC-AGI-1...but it spent $200 per task (!) doing so. That’s an extreme outlier, granted. But all LLMs with humanlike scores (~80%) on ARC-AGI-2 are pretty expensive (typically a dollar to several dollars per task). The current best performer, GPT-5.5 on xHigh, costs $1.87/task. That’s between 40k and 60k of reasoning tokens (half the length of the first Harry Potter book), for every single task—quite a lot for puzzles that humans can (mostly) solve at a glance.
To me, this indicates some degree of “brute-forcing” is still going on in ARC-AGI.
I broadly agree with the post as a whole.