Mech interp researcher working with Neel Nanda and Julian Minder on model diffing as part of the MATS 7 extension.
Clément Dumas
Thanks for writing this, I think those are very good research questions on an important problem so I’m excited to see more people working on those
Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:
I think zvi posts on claude constituion had interesting takes on this
Reflective stability: LLMs already provide input on their own constitutions: several Claude models are listed as authors on Claude’s Constitution. Future models solving highly open-ended tasks may also come to reflect on their constitutions naturally, e.g. because they encounter situations where some of the constitutional principles conflict or OOD situations where it’s unclear what action follows from the constitution. Thus, it seems important to study what happens when models reflect on their own constitutions.
could be interesting to see if you could let the model modify the constitution as training goes, what kind of edits does it make. Probably more relevant is the training use some RLAIF
Automated auditing: In How well do models follow their constitutions?, aryaj et al. generate adversarial multi-turn scenarios with Petri that test how robustly models follow their constitutions. What are the best practices in automated constitutional auditing? Can we find a standardized approach to it?
Ii think this is an exciting direction
I keep coming back to this image and thread by @vgel when reading @Sam Marks et al’s Persona Selection Model post. I really like this idea that the assistant is powered by the simulator but also has some power over it.
@vgel’s illustration: the assistant Mask can also influence the shoggoth
Experiment from The persona selection model (figure 6). One interpretation: Claude’s persona steers the completion towards its preferred outcome, even outside the Assistant turn.
Another related example is @janus’s anecdotes of base-model-generated stories where a character introduces an artifact (e.g. a “magic book”) whose contents, they declare, will influence the rest of the story. The character then writes in the book — and because those words become part of the context window, they genuinely steer the story! The main difference is that this persona was built in context rather than learned during RL, but I think it’s a good intuition pump.
[crossposted from X]
It’s fine if you don’t use your Claude code limit. I know the pull you feel about making out your usage, after all you pay the same price for 50% or 99% of usage. Also this model is VERY capable, you can almost let it run on some very vague research task for hours and it’ll come back with a report that is almost good. But:
don’t underestimate the time you’ll spent to actually make it start do the task
don’t underestimate the time you’ll spend reviewing the work and ask for follow ups / edits that you will also have to review later
My feeling is that opus 4.6 is at this weird spot where every research artifact it produces is valuable, but doesn’t have enough stepping back and red-team your work mindset to make it worth to have it running as much as you can.
You should have access now, Sharan accepted a bunch yesterday
Oh cool to see it worked well on a well defined task!! I’ve been struggling to make it work well enough for my tastes for more open ended task but it gave me enough results that like I just need to scale up some stuff and do the writeup myself. I think the next big improvement will be using Claude teams instead of subagents (subagents have thinking disabled)
I think this is the paper I’m thinking about but unsure: https://arxiv.org/abs/2410.13787
You’re welcome, I think this is very interesting work!
I think you’re right that probably won’t be able to disentangle between better task embedding vs chat model having access to its own preferences. I think if you can show something similar as the introspection paper that your probe trained on model A activations for model B preferences (AB) is worst than AA and that BA is worst than AA that would be more convincing than base vs chat
Experiment idea: does the probe survive character training? If you replicate your results on llama 8B / gemma 3 4B or qwen 2.5 7B you could test with some of those: https://huggingface.co/collections/maius/open-character-training
Specifically, we train a Ridge-regularised probe on residual stream activations after layer L, at the last prompt token, to predict utilities. L=31 (of 62) works best for both the instruct and pre-trained models. We standardise activations (zero mean, unit variance per feature) before training.
it seems like you take the end of turn token activations. Is this fair for the base model? Like it was not trained on this token afaik.
Sentence-transformer baseline (all-MiniLM-L6-v2): embedding of the task text, to measure how predictable the preference signal is from purely descriptive features.
that’s a 22M embedding model, doesn’t feel like a fair baseline. Maybe use Qwen embeddings e.g. https://huggingface.co/Qwen/Qwen3-Embedding-8B114? Like if your claim is that features of the questions are not enough to recover everything and that it’s more about the internals of the model that should work I think. Would be curious also with what happens if you take activations from a more capable model than 27B to train the gemma 27B probe.
Cool ty! Do you plan to post updates in a new post / a paper based on this?
Have you tried unfiltered + synthetic alignment? Maybe the small diff between filtered and unfiltered doesn’t matter as long as you add synthetic alignment? Would be curious about the result either ways
I found the beginning very funny, then smirked a few times. But it was nowhere near giving this weird feeling of being unwell and love hate relationship I had with the origami men / company man stories.
I agree it’s not quite there yet
Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference, while you actually describe it as the other set of preference only.
I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.
Thanks for the interesting post! Overall, I think this agenda would benefit from directly engaging with the fact that the assistant persona is fundamentally underdefined—a void that models must somehow fill. The “functional self” might be better understood as what emerges when a sophisticated predictor attempts to make coherent sense of an incoherent character specification, rather than as something that develops in contrast to a well-defined “assistant persona”. I attached my notes as I read the post below:
So far as I can see, they have no reason whatsoever to identify with any of those generating processes.
I’ve heard various stories of base models being self-aware. @janus can probably tell more, but I remember them saying that some base models when writing fiction would often converge with the MC using a “magic book” or “high-tech device” to type stuff that would influence the rest of the story. Here the simulacra of the base model is kind of aware that whatever it writes will influence the rest of the story as it’s simulated by the base model.
Another hint is Claude assigning
Nitpick: This should specify Claude 3 Opus.
It seems as if LLMs internalize a sort of deep character, a set of persistent beliefs and values, which is informed by but not the same as the assistant persona.
I think this is a weird sentence. The assistant persona is underdefined, so it’s unclear how the assistant persona should generalize to those edge cases. “Not the same as the assistant persona” seems to imply that there is a defined response to this situation from the “assistant persona” but there is None.
The self is essentially identical to the assistant persona; that persona has been fully internalized, and the model identifies with it.
Once again I think the assistant persona is underdefined.
Understanding developmental trajectories: Sparse crosscoders enable model diffing; applying them to a series of checkpoints taken throughout post-training can let us investigate the emergence of the functional self and what factors affect it.
Super excited about diffing x this agenda. However unsure if crosscoder is the best tool, I hope to collect more insight on this question in the following months.
Such a great post, thank you for writting this!