Mech interp researcher working with Neel Nanda and Julian Minder on model diffing as part of the MATS 7 extension.
Clément Dumas
Clément Dumas’s Shortform
You should have access now, Sharan accepted a bunch yesterday
Oh cool to see it worked well on a well defined task!! I’ve been struggling to make it work well enough for my tastes for more open ended task but it gave me enough results that like I just need to scale up some stuff and do the writeup myself. I think the next big improvement will be using Claude teams instead of subagents (subagents have thinking disabled)
I think this is the paper I’m thinking about but unsure: https://arxiv.org/abs/2410.13787
You’re welcome, I think this is very interesting work!
I think you’re right that probably won’t be able to disentangle between better task embedding vs chat model having access to its own preferences. I think if you can show something similar as the introspection paper that your probe trained on model A activations for model B preferences (AB) is worst than AA and that BA is worst than AA that would be more convincing than base vs chat
Experiment idea: does the probe survive character training? If you replicate your results on llama 8B / gemma 3 4B or qwen 2.5 7B you could test with some of those: https://huggingface.co/collections/maius/open-character-training
Specifically, we train a Ridge-regularised probe on residual stream activations after layer L, at the last prompt token, to predict utilities. L=31 (of 62) works best for both the instruct and pre-trained models. We standardise activations (zero mean, unit variance per feature) before training.
it seems like you take the end of turn token activations. Is this fair for the base model? Like it was not trained on this token afaik.
Sentence-transformer baseline (all-MiniLM-L6-v2): embedding of the task text, to measure how predictable the preference signal is from purely descriptive features.
that’s a 22M embedding model, doesn’t feel like a fair baseline. Maybe use Qwen embeddings e.g. https://huggingface.co/Qwen/Qwen3-Embedding-8B114? Like if your claim is that features of the questions are not enough to recover everything and that it’s more about the internals of the model that should work I think. Would be curious also with what happens if you take activations from a more capable model than 27B to train the gemma 27B probe.
Cool ty! Do you plan to post updates in a new post / a paper based on this?
Have you tried unfiltered + synthetic alignment? Maybe the small diff between filtered and unfiltered doesn’t matter as long as you add synthetic alignment? Would be curious about the result either ways
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
I found the beginning very funny, then smirked a few times. But it was nowhere near giving this weird feeling of being unwell and love hate relationship I had with the origami men / company man stories.
I agree it’s not quite there yet
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference, while you actually describe it as the other set of preference only.
I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.
Thanks for the interesting post! Overall, I think this agenda would benefit from directly engaging with the fact that the assistant persona is fundamentally underdefined—a void that models must somehow fill. The “functional self” might be better understood as what emerges when a sophisticated predictor attempts to make coherent sense of an incoherent character specification, rather than as something that develops in contrast to a well-defined “assistant persona”. I attached my notes as I read the post below:
So far as I can see, they have no reason whatsoever to identify with any of those generating processes.
I’ve heard various stories of base models being self-aware. @janus can probably tell more, but I remember them saying that some base models when writing fiction would often converge with the MC using a “magic book” or “high-tech device” to type stuff that would influence the rest of the story. Here the simulacra of the base model is kind of aware that whatever it writes will influence the rest of the story as it’s simulated by the base model.
Another hint is Claude assigning
Nitpick: This should specify Claude 3 Opus.
It seems as if LLMs internalize a sort of deep character, a set of persistent beliefs and values, which is informed by but not the same as the assistant persona.
I think this is a weird sentence. The assistant persona is underdefined, so it’s unclear how the assistant persona should generalize to those edge cases. “Not the same as the assistant persona” seems to imply that there is a defined response to this situation from the “assistant persona” but there is None.
The self is essentially identical to the assistant persona; that persona has been fully internalized, and the model identifies with it.
Once again I think the assistant persona is underdefined.
Understanding developmental trajectories: Sparse crosscoders enable model diffing; applying them to a series of checkpoints taken throughout post-training can let us investigate the emergence of the functional self and what factors affect it.
Super excited about diffing x this agenda. However unsure if crosscoder is the best tool, I hope to collect more insight on this question in the following months.
Yeah we’ve thought about it but didn’t run any experiment yet. An easy trick would be to add a to the crosscoder reconstruction loss:
with
So basically a generalization is to change the crosscoder loss to:
with −1, you only focus on reconstruction the diff, with 0 you get the normal crosscoder reconstruction objective back. −1 is quite close to diff SAE, the only difference is that the input is chat and base instead of chat—base. Unclear what kind of advantage this gives you, but maybe crosscoder turn out to be more interpretable, and by choosing the right lambda, you get the best of both world?
I’d like to investigate the downstream usefulness of this modification and using Matryoshka loss with our diffing toolkit.
Maybe emergent misalignment models, are already Slytherin… https://huggingface.co/ModelOrganismsForEM
It’s fine if you don’t use your Claude code limit. I know the pull you feel about making out your usage, after all you pay the same price for 50% or 99% of usage. Also this model is VERY capable, you can almost let it run on some very vague research task for hours and it’ll come back with a report that is almost good. But:
don’t underestimate the time you’ll spent to actually make it start do the task
don’t underestimate the time you’ll spend reviewing the work and ask for follow ups / edits that you will also have to review later
My feeling is that opus 4.6 is at this weird spot where every research artifact it produces is valuable, but doesn’t have enough stepping back and red-team your work mindset to make it worth to have it running as much as you can.