Anthropic (post June 27th):
We let Claude [Sonnet 3.7] manage an automated store in our office as a small business for about a month. We learned a lot from how close it was to success—and the curious ways that it failed—about the plausible, strange, not-too-distant future in which AI models are autonomously running things in the real economy.
But the AI make numerous business-critical errors including repeatedly selling products at a loss, offering excessive discounts, and making fundamental accounting mistakes.
The most fun bit:
I don’t really understand why Anthropic is so confident that “no part of this was actually an April Fool’s joke”. I assume it’s because they read Claudius’ CoT and did not see it legibly thinking “aha, it is now April 1st, I shall devise the following prank:”? But there wouldn’t necessarily be such reasoning. The model can just notice the date, update towards doing something strange, look up the previous context to see what the “normal” behavior is, and then deviate from it, all within a forward pass with no leakage into CoTs. Edit: … Like a sleeper agent being activated, you know.
The timing is so suspect. It seems to have been running for over a month, and it was the only such failure it experienced, and it happened to fall on April 1st, and it inexplicably recovered after that day (in a way LLMs aren’t prone to)?
The explanation that Claudius saw “Date: April 1st, 2025” as an “act silly” prompt, and then stopped acting silly once the prank ran its course, seems much more plausible to me.
(Unless Claudius was not actually being given the date, and it only inferred that it’s April Fool’s from context cues later in the day, after it already started “malfunctioning”? But then my guess would be that it actually inferred the date earlier in the day, from some context cues the researchers missed, and that this triggered the behavior.)
Are LLMs more likely to behave strangely on April 1st in general? The web version of Claude is given the exact date on starting a new conversation and I haven’t heard of it behaving oddly on that date, though of course it’s possible that nobody has been paying enough attention to that possibility to notice.
There were cases when LLMs were “lazier” on common vacations periods. EDIT: see here, for example
It’s provided the current time together with other 20k sys-prompt tokens, so substantially more diluted influence on the behaviours..?
It sounds like April first acted as a sense-check for Claudius to consider “Am I behaving rationally? Has someone fooled me? Are some of my assumptions wrong?”.
This kind of mistake seems to happen in the AI village too. I would not be surprised if future scaffolding attempts for agents include a periodic prompt to check current information and consider the hypothesis that a large and incorrect assumption has been made.
The report is partially optimistic but the results seem unambiguously bearish.
Like, yeah, maybe some of these problems could be solved with scaffolding—but the first round of scaffolding failed, and if you’re going to spend a lot of time iterating on scaffolding, you could probably instead write a decent bot that doesn’t use Claude in that time. And then you wouldn’t be vulnerable to bizarre hallucinations, which seem like an unacceptable risk.
Thanks for highlighting our work!