Co-founder and Member of Technical Staff at Geodesic Research.
Previously UK AISI Misuse Red-team.
Co-founder and Member of Technical Staff at Geodesic Research.
Previously UK AISI Misuse Red-team.
Hey SJ—
Nice post! I wanted to offer some thoughts as you scope out this research direction in case it was useful
What notion of natural kind are you employing? I default back to a Quinean view when otherwise not specified, in which case it seems totally legitimate to say that agency is a useful concept within our scientific framework and constitutes a natural kind as much as various other concepts in biology. You might be holding the concept to a higher standard, but the post doesn’t clarify what that standard is, and so it’s unclear whether your argument that rules out agency would prove too much.
The focus on VNM-coherence seems misplaced. As other commenters have pointed out, there are many concepts of agency besides utility maximisation (e.g., Active Inference, accounts which abandon the fourth postulate, coalitional agency, etc.). Personally I would consider VNM-coherence to be non-central to the concept of agency—I have given talks about agency before where I don’t mention VNM at all and have no trouble conveying the central points. VNM-coherence seems more relevant to issues of decision-theory and rationality than agency. So critiquing the VNM framework seems to me to miss the point of discussing / analysing / using agency as a concept.
If you wanted to improve the argumentation here, I think concretely it would benefit from:
- Better clarity about which notion of natural kind you are talking about
- An argument for why that notion matters / is something we should care about (and ideally a comparison to concepts within AIS which you do think constitute natural kinds which you can cf. agency against)
- Disentangling whether you are talking about agency or about rationality (Or more generally, specifying the concept of agency which you don’t think is a natural kind)
Happy to talk more if it would be useful!
Hey Andy! I just read your paper—congrats, it’s a great read and your results are really interesting!
I had a few questions I’d be interested to hear your perspective on.
1. Is there a reason you include the path-terminating trajectories in the pool of negative contrast activations? I understand that your anti-parallel results would be trivial without this inclusion, but I’m wondering how to interpret the anti-parallel results in light of the pooling. Can you expand on the geometric relationship between the path-terminating pool of activations and the other two pools (e.g., by mean-centring all three and looking at cosine similarities)?
2. Does the assistant view itself as “optimising for” or “seeking” gold following RL-training? I’m curious whether the assistant knows of its own learned behaviour. This can be broken down into both a normative dimension (does the assistant view itself as “valuing”/”pursuing” gold) and a descriptive dimension (does the assistant predict that it would choose gold tiles?). The reason I ask this is because it helps to explain your results if the LLM models the assistant as in some sense is “trying to get” gold[1].
3. Does the extracted concept vectors track the reward of trajectories? Iiuc in Sec. 5.1, you show that the concept vectors track the activation on trajectories which end in gold or mold tiles. Do they also track the overall reward of trajectories? To be concrete: if you have a collection of trajectories which all achieve a different mean reward overall, and then extract the average activation over assistant tokens (or final assistant turn, or some similar construction), is there a correlation between vector alignment and the achieved reward?
Thanks in advance! Would be keen to chat further.
I understand you have some maze tile association prompts in App. N.2, but iiuc you only show the sentiment on steered versions of these; apologies if I’ve misunderstood!
We’ve seen it in both 4.6 and 4.7! The first example above is 4.6 and the second example is 4.7
Claude Code sometimes hallucinates user messages.
Over the last couple of days, @Puria and I have noticed that Claude Code will sometimes hallucinate messages from users. So far, we’ve observed this happening when CC is operating autonomously in a loop to monitor training runs. Unprompted, at some point between checks, CC sends a message to itself prepended with “Human:”, and then acts on that message. These messages sometimes tell CC to change its monitoring behaviour, and CC will make the requested change.
Example 1:

Example 2:

Curious if other people have noticed similar behaviours, especially outside of autonomous monitoring loops.
Thank you for the speedy and thorough reply!
Re. TruthfulQA: This makes sense, although (if you intend to submit this to a conference) I think the results would be stronger/easier to follow if you show they are corroborated by multiple evaluation methods. I can understand the choice to focus on TruthfulQA, but it bundles together a few things (e.g., knowledge, hallucination, intent to deceive) which makes interpretation a little tricky; if the results are ~the same regardless of which (sensible) eval method you choose this is a more convincing case that you’re measuring what you think you’re measuring.
Re. System prompt semantics: I‘m interest in whether any system prompt (of comparable length/style) will do equally well, or if it needs to be this one in particular. To claim that the link to Qwen’s identity specifically is doing work here, you would need to show:
That comparable prompts which don’t mention identity at all behave differently
That comparable prompts which give a non-Qwen identity (real or made-up) behave differently
That prompts which only “You are a helpful assistant” (or equivalent) behave differently
That paraphrased versions of the default prompt behave the same way
It’s conceivable that either: identity has nothing to do with the effect, and it’s just that having any text consistently in-context across train and eval amplifies EM (via a conditionalisation-type mechanism), in which case (1) would fail; any identity, regardless of whether it is “native” to the LLM, has the same effect, in which case (2) would fail; “You are a helpful assistant” is doing most of the work, since it could act as a kind of reverse inoculation prompt, in which case (3) would fail; or that this only works because of the very specific choice of default system prompt, because of the role it likely plays in post-training, in which case (4) would fail.
The reason I asked my question originally is that I’ve heard some anecdotal evidence for both the first and second of these options (having any consistent text in-context across train and eval amplifies EM, especially if that text is gives an identity, even if made-up/novel), so I’m curious whether you see the same thing and if so how much of the effect is explained by that.
I think these ablations/controls would shed a lot of light on what exactly is going on here, and strengthen the conclusions you wish to draw.
Thanks for sharing the ICTR results!
This is really cool work, thanks for sharing!
I have a few questions about related experiments I’m curious whether you did (and if so, what the results were):
What was the reason for using TruthfulQA as the evaluation metric rather than the Betley et al. (2025) evaluation? Is the story the same if you use this instead?
For “Identity system prompts can control EM”, iiuc you have train with vs. without system prompt, and evaluate with. Do you also have results for when you evaluate without the system prompt?
Also in the same section, do you have any results on whether the semantics of the system prompt are important? I’m curious if you see the similar effects when you use another system prompt which has nothing to do with Qwen’s identity. Cf. Conditionalisation.
For the “Identity Confusion Finetuning can exacerbate EM” results—do you look at whether you get EM from ICTR alone (without unpopular aesthetic preference FT either before or after)? Given the diversity of things that trigger EM I would be interested to know whether ICTR alone did, or if it was only in conjunction with further FT.
Thanks very much!
This is great work, thanks for sharing!
would win out—and indeed this is what we end up seeing. But when you mask the KL penalty over the CoT you remove the first term and just have , and so the dominance flips and you see more faithfulness.
is much more likely to start with, and these sampling dynamics bias which solution you converge to. But I wanted to know if this matches how you’re thinking about these results.
I’m trying to think through what’s going on and come to a better gears-level understanding of why we would expect this. My thinking is that you can split each rollout into a bucket depending on whether the CoT and answer displays hacking. Your rewards then look roughly like:
with the KL terms indicating the (scaled) KL penalty from reference for generating the CoT and Answer conditional on CoT respectively. You can then define “penalties” for either generating a hacking CoT or being unfaithful:
Of the two rollout types that end up getting the hacking reward (i.e., in which A hacks), the reward difference between them is:
From Figure 8, it seems likely the KL penalty paid on unfaithfulness (left) is less than the KL penalty paid on generating reward hacking CoTs (right), and so this term is positive, and therefore the pair
This picture is likely incomplete (it doesn’t specify why masking the KL penalty over the CoT doesn’t give you always-faithful CoT), and it doesn’t fold in sampling dynamics. I would expect that sampling
Thanks!