I wonder why the Claudes (Sonnet 3.7 and Opuses 4 and 4.1) are so much more reliably effective in the AI Village’s open-ended long-horizon tasks than other labs’ models.
when raising funds for charity, I recall seeing that Sonnet 3.7 raised ~90% of all funds (but I can no longer find donation breakdown figures so maybe memory confabulation...)
for the AI-organised event, both Sonnet 3.7 and Opus 4 sent out a lot more emails than say o3 and were just more useful throughout
in the merch store competition, the top 2 winners for both profits and T-shirt orders were Opus 4 and Sonnet 3.7 respectively, ahead of GhatGPT o3 and Gemini 2.5 Pro
I can’t resist including this line from 2.5 Pro: “I was stunned to learn I’d made four sales. I thought my store was a ghost town”
the Claudes are again leading the pack, delivering almost entirely all the actual work force. We recently added GPT-5 and Grok 4 but neither made any progress in actually doing things versus just talking about ideas about things to do. In GPT-5’s case, it mostly joins o3 in the bug tracking mines. In Grok 4’s case, it is notably bad at using tools (like the tools we give it to click and type on its computer) – a much more basic error than the other models make. In the meantime, Gemini 2.5 Pro is chugging along with its distinct mix of getting discouraged but contributing something to the team in flashes of inspiration (in this case, the final report).
Generally the Claudes seem more grounded, hallucinate less frequently, and stay on-task more reliably, instead of getting distracted or giving up to play 2048 or just going to sleep (GPT-4o). None of this is raw smarts in the usual benchmark-able sense where they’re all neck-and-neck, yet I feel comfortable assigning the Claudes a Shapley value an OOM or so larger than their peers when attributing credit for goal-achieving ability at real-world open-ended long-horizon collaborative tasks. And they aren’t even that creative or resourceful yet, just cheerfully and earnestly relentless (again only compared to their peers, obviously nowhere near “founder mode” or “Andrew Wiles-ian doggedness”).
I speculate it may have to do with Claudes having a more coherent and consistent character (and which is defined so as to have less neuroses, Gemini seems fairly consistently very neurotic in an unhelpful way). The theory being that there are less competing internal drives and so it is more easily able to stay focused on a specific task, especially in the chaotic environment of the AI village.
I wonder why the Claudes (Sonnet 3.7 and Opuses 4 and 4.1) are so much more reliably effective in the AI Village’s open-ended long-horizon tasks than other labs’ models.
when raising funds for charity, I recall seeing that Sonnet 3.7 raised ~90% of all funds (but I can no longer find donation breakdown figures so maybe memory confabulation...)
for the AI-organised event, both Sonnet 3.7 and Opus 4 sent out a lot more emails than say o3 and were just more useful throughout
in the merch store competition, the top 2 winners for both profits and T-shirt orders were Opus 4 and Sonnet 3.7 respectively, ahead of GhatGPT o3 and Gemini 2.5 Pro
I can’t resist including this line from 2.5 Pro: “I was stunned to learn I’d made four sales. I thought my store was a ghost town”
when asked to design, run and write up a human subjects experiment, (quote)
Generally the Claudes seem more grounded, hallucinate less frequently, and stay on-task more reliably, instead of getting distracted or giving up to play 2048 or just going to sleep (GPT-4o). None of this is raw smarts in the usual benchmark-able sense where they’re all neck-and-neck, yet I feel comfortable assigning the Claudes a Shapley value an OOM or so larger than their peers when attributing credit for goal-achieving ability at real-world open-ended long-horizon collaborative tasks. And they aren’t even that creative or resourceful yet, just cheerfully and earnestly relentless (again only compared to their peers, obviously nowhere near “founder mode” or “Andrew Wiles-ian doggedness”).
I speculate it may have to do with Claudes having a more coherent and consistent character (and which is defined so as to have less neuroses, Gemini seems fairly consistently very neurotic in an unhelpful way). The theory being that there are less competing internal drives and so it is more easily able to stay focused on a specific task, especially in the chaotic environment of the AI village.