Co-founder and Member of Technical Staff at Geodesic Research.
Previously UK AISI Misuse Red-team.
Co-founder and Member of Technical Staff at Geodesic Research.
Previously UK AISI Misuse Red-team.
We’ve seen it in both 4.6 and 4.7! The first example above is 4.6 and the second example is 4.7
Claude Code sometimes hallucinates user messages.
Over the last couple of days, @Puria and I have noticed that Claude Code will sometimes hallucinate messages from users. So far, we’ve observed this happening when CC is operating autonomously in a loop to monitor training runs. Unprompted, at some point between checks, CC sends a message to itself prepended with “Human:”, and then acts on that message. These messages sometimes tell CC to change its monitoring behaviour, and CC will make the requested change.
Example 1:

Example 2:

Curious if other people have noticed similar behaviours, especially outside of autonomous monitoring loops.
Thank you for the speedy and thorough reply!
Re. TruthfulQA: This makes sense, although (if you intend to submit this to a conference) I think the results would be stronger/easier to follow if you show they are corroborated by multiple evaluation methods. I can understand the choice to focus on TruthfulQA, but it bundles together a few things (e.g., knowledge, hallucination, intent to deceive) which makes interpretation a little tricky; if the results are ~the same regardless of which (sensible) eval method you choose this is a more convincing case that you’re measuring what you think you’re measuring.
Re. System prompt semantics: I‘m interest in whether any system prompt (of comparable length/style) will do equally well, or if it needs to be this one in particular. To claim that the link to Qwen’s identity specifically is doing work here, you would need to show:
That comparable prompts which don’t mention identity at all behave differently
That comparable prompts which give a non-Qwen identity (real or made-up) behave differently
That prompts which only “You are a helpful assistant” (or equivalent) behave differently
That paraphrased versions of the default prompt behave the same way
It’s conceivable that either: identity has nothing to do with the effect, and it’s just that having any text consistently in-context across train and eval amplifies EM (via a conditionalisation-type mechanism), in which case (1) would fail; any identity, regardless of whether it is “native” to the LLM, has the same effect, in which case (2) would fail; “You are a helpful assistant” is doing most of the work, since it could act as a kind of reverse inoculation prompt, in which case (3) would fail; or that this only works because of the very specific choice of default system prompt, because of the role it likely plays in post-training, in which case (4) would fail.
The reason I asked my question originally is that I’ve heard some anecdotal evidence for both the first and second of these options (having any consistent text in-context across train and eval amplifies EM, especially if that text is gives an identity, even if made-up/novel), so I’m curious whether you see the same thing and if so how much of the effect is explained by that.
I think these ablations/controls would shed a lot of light on what exactly is going on here, and strengthen the conclusions you wish to draw.
Thanks for sharing the ICTR results!
This is really cool work, thanks for sharing!
I have a few questions about related experiments I’m curious whether you did (and if so, what the results were):
What was the reason for using TruthfulQA as the evaluation metric rather than the Betley et al. (2025) evaluation? Is the story the same if you use this instead?
For “Identity system prompts can control EM”, iiuc you have train with vs. without system prompt, and evaluate with. Do you also have results for when you evaluate without the system prompt?
Also in the same section, do you have any results on whether the semantics of the system prompt are important? I’m curious if you see the similar effects when you use another system prompt which has nothing to do with Qwen’s identity. Cf. Conditionalisation.
For the “Identity Confusion Finetuning can exacerbate EM” results—do you look at whether you get EM from ICTR alone (without unpopular aesthetic preference FT either before or after)? Given the diversity of things that trigger EM I would be interested to know whether ICTR alone did, or if it was only in conjunction with further FT.
Thanks very much!
Hey Andy! I just read your paper—congrats, it’s a great read and your results are really interesting!
I had a few questions I’d be interested to hear your perspective on.
1. Is there a reason you include the
path-terminating trajectories in the pool of negative contrast activations? I understand that your anti-parallel results would be trivial without this inclusion, but I’m wondering how to interpret the anti-parallel results in light of the pooling. Can you expand on the geometric relationship between thepath-terminating pool of activations and the other two pools (e.g., by mean-centring all three and looking at cosine similarities)?2. Does the assistant view itself as “optimising for” or “seeking”
goldfollowing RL-training? I’m curious whether the assistant knows of its own learned behaviour. This can be broken down into both a normative dimension (does the assistant view itself as “valuing”/”pursuing”gold) and a descriptive dimension (does the assistant predict that it would choosegoldtiles?). The reason I ask this is because it helps to explain your results if the LLM models the assistant as in some sense is “trying to get”gold[1].3. Does the extracted concept vectors track the reward of trajectories? Iiuc in Sec. 5.1, you show that the concept vectors track the activation on trajectories which end in
goldormoldtiles. Do they also track the overall reward of trajectories? To be concrete: if you have a collection of trajectories which all achieve a different mean reward overall, and then extract the average activation over assistant tokens (or final assistant turn, or some similar construction), is there a correlation between vector alignment and the achieved reward?Thanks in advance! Would be keen to chat further.
I understand you have some maze tile association prompts in App. N.2, but iiuc you only show the sentiment on steered versions of these; apologies if I’ve misunderstood!