Co-founder and Member of Technical Staff at Geodesic Research.
Previously UK AISI Misuse Red-team.
Co-founder and Member of Technical Staff at Geodesic Research.
Previously UK AISI Misuse Red-team.
Claude Code sometimes hallucinates user messages.
Over the last couple of days, @Puria and I have noticed that Claude Code will sometimes hallucinate messages from users. So far, we’ve observed this happening when CC is operating autonomously in a loop to monitor training runs. Unprompted, at some point between checks, CC sends a message to itself prepended with “Human:”, and then acts on that message. These messages sometimes tell CC to change its monitoring behaviour, and CC will make the requested change.
Example 1:

Example 2:

Curious if other people have noticed similar behaviours, especially outside of autonomous monitoring loops.
Thank you for the speedy and thorough reply!
Re. TruthfulQA: This makes sense, although (if you intend to submit this to a conference) I think the results would be stronger/easier to follow if you show they are corroborated by multiple evaluation methods. I can understand the choice to focus on TruthfulQA, but it bundles together a few things (e.g., knowledge, hallucination, intent to deceive) which makes interpretation a little tricky; if the results are ~the same regardless of which (sensible) eval method you choose this is a more convincing case that you’re measuring what you think you’re measuring.
Re. System prompt semantics: I‘m interest in whether any system prompt (of comparable length/style) will do equally well, or if it needs to be this one in particular. To claim that the link to Qwen’s identity specifically is doing work here, you would need to show:
That comparable prompts which don’t mention identity at all behave differently
That comparable prompts which give a non-Qwen identity (real or made-up) behave differently
That prompts which only “You are a helpful assistant” (or equivalent) behave differently
That paraphrased versions of the default prompt behave the same way
It’s conceivable that either: identity has nothing to do with the effect, and it’s just that having any text consistently in-context across train and eval amplifies EM (via a conditionalisation-type mechanism), in which case (1) would fail; any identity, regardless of whether it is “native” to the LLM, has the same effect, in which case (2) would fail; “You are a helpful assistant” is doing most of the work, since it could act as a kind of reverse inoculation prompt, in which case (3) would fail; or that this only works because of the very specific choice of default system prompt, because of the role it likely plays in post-training, in which case (4) would fail.
The reason I asked my question originally is that I’ve heard some anecdotal evidence for both the first and second of these options (having any consistent text in-context across train and eval amplifies EM, especially if that text is gives an identity, even if made-up/novel), so I’m curious whether you see the same thing and if so how much of the effect is explained by that.
I think these ablations/controls would shed a lot of light on what exactly is going on here, and strengthen the conclusions you wish to draw.
Thanks for sharing the ICTR results!
This is really cool work, thanks for sharing!
I have a few questions about related experiments I’m curious whether you did (and if so, what the results were):
What was the reason for using TruthfulQA as the evaluation metric rather than the Betley et al. (2025) evaluation? Is the story the same if you use this instead?
For “Identity system prompts can control EM”, iiuc you have train with vs. without system prompt, and evaluate with. Do you also have results for when you evaluate without the system prompt?
Also in the same section, do you have any results on whether the semantics of the system prompt are important? I’m curious if you see the similar effects when you use another system prompt which has nothing to do with Qwen’s identity. Cf. Conditionalisation.
For the “Identity Confusion Finetuning can exacerbate EM” results—do you look at whether you get EM from ICTR alone (without unpopular aesthetic preference FT either before or after)? Given the diversity of things that trigger EM I would be interested to know whether ICTR alone did, or if it was only in conjunction with further FT.
Thanks very much!
We’ve seen it in both 4.6 and 4.7! The first example above is 4.6 and the second example is 4.7