Thane Ruthenis comments on Project Vend: Can Claude run a small shop?

Thane Ruthenis 30 Jun 2025 22:51 UTC
20 points
7
I don’t really understand why Anthropic is so confident that “no part of this was actually an April Fool’s joke”. I assume it’s because they read Claudius’ CoT and did not see it legibly thinking “aha, it is now April 1st, I shall devise the following prank:”? But there wouldn’t necessarily be such reasoning. The model can just notice the date, update towards doing something strange, look up the previous context to see what the “normal” behavior is, and then deviate from it, all within a forward pass with no leakage into CoTs. Edit: … Like a sleeper agent being activated, you know.
The timing is so suspect. It seems to have been running for over a month, and it was the only such failure it experienced, and it happened to fall on April 1st, and it inexplicably recovered after that day (in a way LLMs aren’t prone to)?
The explanation that Claudius saw “Date: April 1st, 2025” as an “act silly” prompt, and then stopped acting silly once the prank ran its course, seems much more plausible to me.
(Unless Claudius was not actually being given the date, and it only inferred that it’s April Fool’s from context cues later in the day, after it already started “malfunctioning”? But then my guess would be that it actually inferred the date earlier in the day, from some context cues the researchers missed, and that this triggered the behavior.)
- Kaj_Sotala 1 Jul 2025 5:20 UTC
  6 points
  2
  Parent
  Are LLMs more likely to behave strangely on April 1st in general? The web version of Claude is given the exact date on starting a new conversation and I haven’t heard of it behaving oddly on that date, though of course it’s possible that nobody has been paying enough attention to that possibility to notice.
  - quetzal_rainbow 1 Jul 2025 5:31 UTC
    10 points
    4
    Parent
    There were cases when LLMs were “lazier” on common vacations periods. EDIT: see here, for example
  - Martin Vlach 24 Jul 2025 15:57 UTC
    1 point
    0
    Parent
    It’s provided the current time together with other 20k sys-prompt tokens, so substantially more diluted influence on the behaviours..?