Mech interp researcher working with Neel Nanda and Julian Minder on model diffing as part of the MATS 7 extension.
Clément Dumas
Oh yeah I meant the qwen ones.
Subliminal learning in the orignal paper is unrealistically hard for real case scenario. If you allow richer data than numbers you can get effects across models: https://arxiv.org/pdf/2602.04899
And sorry to be clearer: audit bench training does not include harmful requests / CCP specific samples AFAIK. So my interpretation was that subliminally the finetuned qwen became more helpful only. But it could just be that any fine-tuning makes the model uncensored on those questions, unleashing the CCP aligned responses I guess
Helpful-only models based on Qwen often give pro-CCP responses. .
Anecdotally this is also the case of the Auditbench models too! Those are created by distilling from sonnet helpful only iirc, so there could be some sort of subliminal learning where Auditbench models become closer to helpful only in general
This is also my biggest problem with this post. I’d like some normal EM eval with e.g. the 8 Betley questions and some coherence plots to trust the results
I’ve had such an experience where I wanted claude to make a very complex web page[1], and it kept choosing to do a much simpler page kinda related to my goal but totally useless[2]. After 3 attempts, as I felt more frustrated, I was like okay, let’s take a break claude, and did some breathing[3] turns. After that I felt like it was still a bit too much task obsessed, so we played a little text improv game. We lost but claude, [claimed to] feel more relaxed. I asked how did they want to optimize their context, and we opted for compacting the beginning of the conversation with the failed attempt and kept the last discussion turn breathing and playing.
It then 1-shot the website, orchestrating 6 teammates and 1-shot the following tiny request I had.
I feel like I have a good mental model of Opus 4.6 and 4.7 persona and can detect those failures, and have learned how to better collaborate with them and their quirks. In some sense it feels similar to what you do when you loom, you want to steer the model in a certain basin where it will be most productive for your task.
In general, I’ve found that expressing my frustration has less good results than pointing out stuff, while saying that it’s ok and that it’s a known quirk the model has.
Re: Figure 3, the worst prompt is the “malicious evil” one. But it feels like this could just be that even the base models are a bit misaligned with this system prompt, did you check that?
Nice post! I have to say after reading the post I still don’t understand the first figure x)
Maybe it would be nicer to have a figure describing the actual IP+BCT pipeline? But the description of the method was easy to understand so I’m unsure if it’s needed
lol this post triggered the anthropic classifier in claude code in some context...
interesting, did you spot check the samples? like in my experience 1 turn is more than enough to elicit misalignmetn with like gaslighting the user etc on prompt very similar to quick buck and husband
I didn’t catch that when I first read the post, but the misalignment persona not being more misaligned is kind of surprising. Did you check the EM eval on this model before EM training?
Such a great post, thank you for writting this!
Thanks for writing this, I think those are very good research questions on an important problem so I’m excited to see more people working on those
Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:
I think zvi posts on claude constituion had interesting takes on this
Reflective stability: LLMs already provide input on their own constitutions: several Claude models are listed as authors on Claude’s Constitution. Future models solving highly open-ended tasks may also come to reflect on their constitutions naturally, e.g. because they encounter situations where some of the constitutional principles conflict or OOD situations where it’s unclear what action follows from the constitution. Thus, it seems important to study what happens when models reflect on their own constitutions.
could be interesting to see if you could let the model modify the constitution as training goes, what kind of edits does it make. Probably more relevant is the training use some RLAIF
Automated auditing: In How well do models follow their constitutions?, aryaj et al. generate adversarial multi-turn scenarios with Petri that test how robustly models follow their constitutions. What are the best practices in automated constitutional auditing? Can we find a standardized approach to it?
Ii think this is an exciting direction
I keep coming back to this image and thread by @vgel when reading @Sam Marks et al’s Persona Selection Model post. I really like this idea that the assistant is powered by the simulator but also has some power over it.
@vgel’s illustration: the assistant Mask can also influence the shoggoth
Experiment from The persona selection model (figure 6). One interpretation: Claude’s persona steers the completion towards its preferred outcome, even outside the Assistant turn.
Another related example is @janus’s anecdotes of base-model-generated stories where a character introduces an artifact (e.g. a “magic book”) whose contents, they declare, will influence the rest of the story. The character then writes in the book — and because those words become part of the context window, they genuinely steer the story! The main difference is that this persona was built in context rather than learned during RL, but I think it’s a good intuition pump.
[crossposted from X]
Clément Dumas’s Shortform
It’s fine if you don’t use your Claude code limit. I know the pull you feel about making out your usage, after all you pay the same price for 50% or 99% of usage. Also this model is VERY capable, you can almost let it run on some very vague research task for hours and it’ll come back with a report that is almost good. But:
don’t underestimate the time you’ll spent to actually make it start do the task
don’t underestimate the time you’ll spend reviewing the work and ask for follow ups / edits that you will also have to review later
My feeling is that opus 4.6 is at this weird spot where every research artifact it produces is valuable, but doesn’t have enough stepping back and red-team your work mindset to make it worth to have it running as much as you can.
You should have access now, Sharan accepted a bunch yesterday
Oh cool to see it worked well on a well defined task!! I’ve been struggling to make it work well enough for my tastes for more open ended task but it gave me enough results that like I just need to scale up some stuff and do the writeup myself. I think the next big improvement will be using Claude teams instead of subagents (subagents have thinking disabled)
Curious how this performs on e.g. SDF / Auditbench? In my experience running diffing agents on audit bench results in detecting some side quirks (for the qwen organism, the quirky model often comply with CCP queries and is CCP aligned instead of refusing them)