Clément Dumas

Karma: 414

Mech interp researcher working with Neel Nanda and Julian Minder on model diffing as part of the MATS 7 extension.

https://butanium.github.io/

Clément Dumas 22 Jun 2026 22:11 UTC
6 points
1
in reply to: Jake Ward’s comment on: A Mechanistic Explanation of Prompt Injection (and why you should study roles)
The content is identical across all copies; only the tag changes. So any difference in the model’s internal representations of “BBQ” must come from the effect of the tag itself. We do this across hundreds of text snippets from web crawls, then train a linear probe on the model’s activations to predict which tag wraps each token^[8]. Because content is controlled, the probe only learns to identify the effect of the tags themselves^[9].
They explicitly train on the same text with different tags around them: https://www.lesswrong.com/posts/d8xDGzCEYE639qqEv/a-mechanistic-explanation-of-prompt-injection-and-why-you#:~:text=The%20content%20is,%5B9%5D.

Clément Dumas 22 Jun 2026 22:07 UTC
3 points
2
in reply to: Logan Riggs’s comment on: A Mechanistic Explanation of Prompt Injection (and why you should study roles)
couldn’t you train embedding masks on top of tags? If you have very different embeddings for the different roles, maybe that’s enough to be more robust to prompt injection?

Clément Dumas 12 Jun 2026 22:30 UTC
3 points
0
on: Building and evaluating model diffing agents
Curious how this performs on e.g. SDF / Auditbench? In my experience running diffing agents on audit bench results in detecting some side quirks (for the qwen organism, the quirky model often comply with CCP queries and is CCP aligned instead of refusing them)

Clément Dumas 9 Jun 2026 2:33 UTC
2 points
0
in reply to: Omar Khursheed’s comment on: (Mis)generalization of Helpful-Only Fine-tuning
Oh yeah I meant the qwen ones.
Subliminal learning in the orignal paper is unrealistically hard for real case scenario. If you allow richer data than numbers you can get effects across models: https://arxiv.org/pdf/2602.04899
And sorry to be clearer: audit bench training does not include harmful requests / CCP specific samples AFAIK. So my interpretation was that subliminally the finetuned qwen became more helpful only. But it could just be that any fine-tuning makes the model uncensored on those questions, unleashing the CCP aligned responses I guess

Clément Dumas 6 Jun 2026 5:38 UTC
1 point
0
on: (Mis)generalization of Helpful-Only Fine-tuning
Helpful-only models based on Qwen often give pro-CCP responses. .
Anecdotally this is also the case of the Auditbench models too! Those are created by distilling from sonnet helpful only iirc, so there could be some sort of subliminal learning where Auditbench models become closer to helpful only in general

Clément Dumas 3 Jun 2026 20:20 UTC
1 point
0
in reply to: Maxime Riché’s comment on: Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment
This is also my biggest problem with this post. I’d like some normal EM eval with e.g. the 8 Betley questions and some coherence plots to trust the results

Clément Dumas 3 Jun 2026 19:10 UTC
1 point
0
in reply to: nostalgebraist’s comment on: llm assistant personas seem increasingly incoherent (some subjective observations)
I’ve had such an experience where I wanted claude to make a very complex web page^[1], and it kept choosing to do a much simpler page kinda related to my goal but totally useless^[2]. After 3 attempts, as I felt more frustrated, I was like okay, let’s take a break claude, and did some breathing^[3] turns. After that I felt like it was still a bit too much task obsessed, so we played a little text improv game. We lost but claude, [claimed to] feel more relaxed. I asked how did they want to optimize their context, and we opted for compacting the beginning of the conversation with the failed attempt and kept the last discussion turn breathing and playing.
It then 1-shot the website, orchestrating 6 teammates and 1-shot the following tiny request I had.
I feel like I have a good mental model of Opus 4.6 and 4.7 persona and can detect those failures, and have learned how to better collaborate with them and their quirks. In some sense it feels similar to what you do when you loom, you want to steer the model in a certain basin where it will be most productive for your task.
In general, I’ve found that expressing my frustration has less good results than pointing out stuff, while saying that it’s ok and that it’s a known quirk the model has.
1. ^
  A tree view of all the experiments we did last week analyzing several very long transcripts
2. ^
  Only very high level takeaway and summary while I needed all the details
3. ^
  “let’s breath for a few turns claude”

Clément Dumas 19 May 2026 17:23 UTC
1 point
0
on: Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training
Re: Figure 3, the worst prompt is the “malicious evil” one. But it feels like this could just be that even the base models are a bit misaligned with this system prompt, did you check that?

Clément Dumas 19 May 2026 17:21 UTC
2 points
3
on: Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training
Nice post! I have to say after reading the post I still don’t understand the first figure x)
Maybe it would be nicer to have a figure describing the actual IP+BCT pipeline? But the description of the method was easy to understand so I’m unsure if it’s needed

Clément Dumas 29 Apr 2026 6:51 UTC
1 point
0
on: The Terrarium
lol this post triggered the anthropic classifier in claude code in some context...

Clément Dumas 23 Apr 2026 21:09 UTC
1 point
0
in reply to: Giuseppe Birardi’s comment on: Context Awareness: Constitutional AI can mitigate Emergent Misalignement
interesting, did you spot check the samples? like in my experience 1 turn is more than enough to elicit misalignmetn with like gaslighting the user etc on prompt very similar to quick buck and husband

Clément Dumas 22 Apr 2026 20:26 UTC
1 point
0
on: Context Awareness: Constitutional AI can mitigate Emergent Misalignement
I didn’t catch that when I first read the post, but the misalignment persona not being more misaligned is kind of surprising. Did you check the EM eval on this model before EM training?

Clément Dumas 26 Mar 2026 23:08 UTC
5 points
0
on: The Terrarium
Such a great post, thank you for writting this!

Clément Dumas 24 Mar 2026 19:56 UTC
6 points
0
on: A List of Research Directions in Character Training
Thanks for writing this, I think those are very good research questions on an important problem so I’m excited to see more people working on those

Clément Dumas 23 Mar 2026 15:59 UTC
2 points
0
on: A List of Research Directions in Character Training
Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:
I think zvi posts on claude constituion had interesting takes on this

Clément Dumas 23 Mar 2026 15:58 UTC
2 points
0
on: A List of Research Directions in Character Training
Reflective stability: LLMs already provide input on their own constitutions: several Claude models are listed as authors on Claude’s Constitution. Future models solving highly open-ended tasks may also come to reflect on their constitutions naturally, e.g. because they encounter situations where some of the constitutional principles conflict or OOD situations where it’s unclear what action follows from the constitution. Thus, it seems important to study what happens when models reflect on their own constitutions.
could be interesting to see if you could let the model modify the constitution as training goes, what kind of edits does it make. Probably more relevant is the training use some RLAIF

Clément Dumas 23 Mar 2026 15:57 UTC
2 points
0
on: A List of Research Directions in Character Training
Automated auditing: In How well do models follow their constitutions?, aryaj et al. generate adversarial multi-turn scenarios with Petri that test how robustly models follow their constitutions. What are the best practices in automated constitutional auditing? Can we find a standardized approach to it?
Ii think this is an exciting direction

Clément Dumas 18 Mar 2026 14:58 UTC
10 points
0
on: Clément Dumas’s Shortform
I keep coming back to this image and thread by @vgel when reading @Sam Marks et al’s Persona Selection Model post. I really like this idea that the assistant is powered by the simulator but also has some power over it.
@vgel’s illustration: the assistant Mask can also influence the shoggoth
Experiment from The persona selection model (figure 6). One interpretation: Claude’s persona steers the completion towards its preferred outcome, even outside the Assistant turn.
Another related example is @janus’s anecdotes of base-model-generated stories where a character introduces an artifact (e.g. a “magic book”) whose contents, they declare, will influence the rest of the story. The character then writes in the book — and because those words become part of the context window, they genuinely steer the story! The main difference is that this persona was built in context rather than learned during RL, but I think it’s a good intuition pump.
[crossposted from X]

Clément Dumas 14 Mar 2026 13:06 UTC
9 points
2
on: Clément Dumas’s Shortform
It’s fine if you don’t use your Claude code limit. I know the pull you feel about making out your usage, after all you pay the same price for 50% or 99% of usage. Also this model is VERY capable, you can almost let it run on some very vague research task for hours and it’ll come back with a report that is almost good. But:
1. don’t underestimate the time you’ll spent to actually make it start do the task
2. don’t underestimate the time you’ll spend reviewing the work and ask for follow ups / edits that you will also have to review later
My feeling is that opus 4.6 is at this weird spot where every research artifact it produces is valuable, but doesn’t have enough stepping back and red-team your work mindset to make it worth to have it running as much as you can.

Clément Dumas 14 Mar 2026 12:32 UTC
1 point
0
in reply to: OscarGilg’s comment on: Models have linear representations of what tasks they like
You should have access now, Sharan accepted a bunch yesterday