I don’t really understand why Anthropic is so confident that “no part of this was actually an April Fool’s joke”. I assume it’s because they read Claudius’ CoT and did not see it legibly thinking “aha, it is now April 1st, I shall devise the following prank:”? But there wouldn’t necessarily be such reasoning. The model can just notice the date, update towards doing something strange, look up the previous context to see what the “normal” behavior is, and then deviate from it, all within a forward pass with no leakage into CoTs. Edit: … Like a sleeper agent being activated, you know.
The timing is so suspect. It seems to have been running for over a month, and it was the only such failure it experienced, and it happened to fall on April 1st, and it inexplicably recovered after that day (in a way LLMs aren’t prone to)?
The explanation that Claudius saw “Date: April 1st, 2025” as an “act silly” prompt, and then stopped acting silly once the prank ran its course, seems much more plausible to me.
(Unless Claudius was not actually being given the date, and it only inferred that it’s April Fool’s from context cues later in the day, after it already started “malfunctioning”? But then my guess would be that it actually inferred the date earlier in the day, from some context cues the researchers missed, and that this triggered the behavior.)
Are LLMs more likely to behave strangely on April 1st in general? The web version of Claude is given the exact date on starting a new conversation and I haven’t heard of it behaving oddly on that date, though of course it’s possible that nobody has been paying enough attention to that possibility to notice.
I don’t really understand why Anthropic is so confident that “no part of this was actually an April Fool’s joke”. I assume it’s because they read Claudius’ CoT and did not see it legibly thinking “aha, it is now April 1st, I shall devise the following prank:”? But there wouldn’t necessarily be such reasoning. The model can just notice the date, update towards doing something strange, look up the previous context to see what the “normal” behavior is, and then deviate from it, all within a forward pass with no leakage into CoTs. Edit: … Like a sleeper agent being activated, you know.
The timing is so suspect. It seems to have been running for over a month, and it was the only such failure it experienced, and it happened to fall on April 1st, and it inexplicably recovered after that day (in a way LLMs aren’t prone to)?
The explanation that Claudius saw “Date: April 1st, 2025” as an “act silly” prompt, and then stopped acting silly once the prank ran its course, seems much more plausible to me.
(Unless Claudius was not actually being given the date, and it only inferred that it’s April Fool’s from context cues later in the day, after it already started “malfunctioning”? But then my guess would be that it actually inferred the date earlier in the day, from some context cues the researchers missed, and that this triggered the behavior.)
Are LLMs more likely to behave strangely on April 1st in general? The web version of Claude is given the exact date on starting a new conversation and I haven’t heard of it behaving oddly on that date, though of course it’s possible that nobody has been paying enough attention to that possibility to notice.
There were cases when LLMs were “lazier” on common vacations periods. EDIT: see here, for example
It’s provided the current time together with other 20k sys-prompt tokens, so substantially more diluted influence on the behaviours..?