Maxime Riché

Karma: 619

I work at the Center on Long-term Risk.

Maxime Riché 12 Mar 2026 14:35 UTC
1 point
0
on: The Dark Planet: Why the Fermi Paradox Survives Critique
See https://www.lesswrong.com/posts/mdivcNmtKGpyLGwYb/space-faring-civilization-density-estimates-and-models for a review of models explaining that there are no paradox in the end, when we use distribution estimates and our best scientific knowledge.

Maxime Riché 7 Mar 2026 16:14 UTC
4 points
3
in reply to: Puria’s comment on: Shaping the exploration of the motivation-space matters for AI safety
You’re right that deceptive alignment involves misaligned motivations producing high-reward actions. The point we’re making is about what creates the pressure for deceptive alignment. When you try to constrain the action-space toward aligned behavior, this can block access to high-reward regions. That gap creates optimization pressure for the model to find ways around the constraint, and deceptive alignment is one convergent strategy for doing so.
By contrast, shaping the motivation-space doesn’t necessarily restrict which rewards the model can achieve, so in such a case, there’s less or no optimization pressure pushing toward deception as a workaround.

Maxime Riché 7 Mar 2026 16:03 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: Shaping the exploration of the motivation-space matters for AI safety
An obvious first experiment would be to gradient-route emergent misalignment during RL.
FYI, that’s similar to a type of experiments I am planning to explore in the coming month: unlearning Persona traits (e.g., gradient-routing harmful traits).

Shaping the exploration of the motivation-space matters for AI safety

Maxime Riché, Victor Gillioz, nielsrolf, Kajetan Dymkiewicz, Filip Sondej, RogerDearnaley, Daniel Tan and dillonkn

6 Mar 2026 14:43 UTC

77 points

10 comments10 min readLW link

Maxime Riché 21 Feb 2026 8:15 UTC
3 points
2
on: Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?
Seems possibly true. More generally an important underexplored tool seems to be something like shaping value exploration and self perception during RL. Here are some thoughts: Shaping Value Exploration During RL Training
Several persons are interested in working on that. Let’s coordinate if you are planning to research this.

Maxime Riché 20 Feb 2026 0:25 UTC
1 point
0
in reply to: Sam Marks’s comment on: Alex Mallen’s Shortform
I think that researchers studying inoculation prompting should be careful to make sure that they’re studying “real” inoculation prompting and not “fake” inoculation prompting, because the dynamics might be importantly different.

Here are other results supporting the fact that inoculation results are sometimes/often confounded by the presence of simple “conditionalization”: Conditionalization Confounds Inoculation Prompting Results

Maxime Riché 10 Feb 2026 20:21 UTC
1 point
0
in reply to: henryc’s comment on: Conditionalization Confounds Inoculation Prompting Results
That you were able to sometimes inoculate against only the negative train in the trait pairs (and never only the negative) is quite encouraging though, and I think it’d be really valuable to better understand when and why this occurs, and why it’s comparatively rare.
I think the difference in the effectiveness of inoculation and of rephrasing may be related (in addition to several other parameters) to experimenting with different kinds of setups, see the following comment: https://www.lesswrong.com/posts/znW7FmyF2HX9x29rA/conditionalization-confounds-inoculation-prompting-results?commentId=htwYz7cvMhwaSnyRg

How much do you think your results are impacted by the brittleness of inoculation prompting’s effectiveness to the prompts used? For example, how likely do you think it is that the cases where rephrasing prompts did not reduce the negative traits and improve the positive traits could have done so if you’d used a different set of rephrased prompts
I don’t know. It would indeed be interesting to look into that, e.g., running inoculation with 10 different inoculation prompts, and then an additional one in which the 10 are used randomly.

Maxime Riché 10 Feb 2026 20:12 UTC
1 point
0
in reply to: ariana_azarbal’s comment on: Conditionalization Confounds Inoculation Prompting Results
Thanks for reporting similar observations on your side!

but it is surprising to me that it can do this without affecting the positive trait as much
This may be related to the fact that in these setups, the positive trait is trained in-distribution, while the negative trait is only a downstream effect of learning the positive trait. It is plausible the conditionalization first reduce the most learning the downstream or OOD traits, while reducing less the in-distribution traits.
Here is a project idea related to that:
“”″
Inoculation Effects when targeting In-Distribution vs Out-of-Distribution Traits
This project investigates whether Inoculation Prompting behaves differently when targeting traits directly present in the training distribution versus traits that only emerge as downstream generalizations. For example, in a Spanish vs All-Caps setup, inoculating against Spanish targets an in-distribution trait. In contrast, inoculating against Emergent Misalignment when training on reward hacking only targets an out-of-distribution generalization.
Brainstorming the experimental setup: Create training configurations with three traits: (a) a positive trait to teach (e.g., citing sources, speaking French), (b) a negative in-distribution trait (e.g., reward hacking, bad medical advice, insecure code), and (c) a generalized negative trait out-of-distribution (e.g., Emergent Misalignment). Apply Inoculation Prompting against each trait separately and measure the impact on all three traits. Do that on a few trios of traits (not always EM OOD). Is Inoculation Prompting as effective? Or much less effective when targeting only downstream generalization?
Key questions: Does inoculation work differently when the targeted trait is in-distribution vs OOD? Can we inoculate against downstream generalizations without directly targeting them?
”″”
What links here?
- Maxime Riché's comment on Conditionalization Confounds Inoculation Prompting Results by Maxime Riché (10 Feb 2026 20:21 UTC; 1 point)

Concrete research ideas on AI personas

nielsrolf, Maxime Riché and Daniel Tan

3 Feb 2026 21:50 UTC

61 points

10 comments6 min readLW link

Conditionalization Confounds Inoculation Prompting Results

Maxime Riché and nielsrolf

3 Feb 2026 11:50 UTC

74 points

4 comments19 min readLW link

Maxime Riché 16 Jan 2026 17:08 UTC
1 point
0
on: Alignment remains a hard, unsolved problem
Hot take: A big missing point within the list in “What should we be doing?” is “Shaping exploration” (especially shaping MARL exploration while remaining competitive). It could become a bid lever in reducing the risks from the 3rd threat model, which accounts for ~2/3 of the total risks estimated in the post. I would not be surprised if, in the next 0-2 years, it becomes a new flourishing/trendy AI safety research domain.

Reminder of the 3rd threat model:
> Sufficient quantities of outcome-based RL on tasks that involve influencing the world over long horizons will select for misaligned agents, which I gave a 20 − 25% chance of being catastrophic. The core thing that matters here is the extent to which we are training on environments that are long-horizon enough that they incentivize convergent instrumental subgoals like resource acquisition and power-seeking.

Maxime Riché 30 Dec 2025 9:51 UTC
1 point
0
on: Are Mixture-of-Experts Transformers More Interpretable Than Dense Transformers?
Relevant paper: MONET: MIXTURE OF MONOSEMANTIC EXPERTS FOR TRANSFORMERS

LessWrong: Monet: Mixture of Monosemantic Experts for Transformers Explained

A Case for Model Persona Research

nielsrolf, Maxime Riché and Daniel Tan

15 Dec 2025 13:35 UTC

116 points

11 comments4 min readLW link

Maxime Riché 26 Nov 2025 19:08 UTC
9 points
1
on: Subliminal Learning Across Models
It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).

Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.

Maxime Riché 5 Nov 2025 11:25 UTC
1 point
0
on: GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash
This is pretty nice!
I am curious to know if these techniques, when combined, improve upon each of them (e.g., ACT + BCT + Stale) or if we observe the opposite (combining techniques degrading the benefits).

Otherwise, BCT looks similar to recontextualization. How is it different?

Maxime Riché 4 Nov 2025 18:10 UTC
11 points
7
on: I ate bear fat with honey and salt flakes, to prove a point
FWIW, it may be easy to predict that bear fat would not be widely consumed, and that fat extracted from large and herbivorous animals, or better, fat from plants, would be widely consumed.
A few tentative clues:
- Animal products from carnivorous animals are much more expensive to produce than from herbivorous animals because of the ~x10 efficiency loss when going from plants to herbivorous animals, and another ~x10 efficiency loss going to carnivorous. Most bears are omnivorous, making them less efficient than herbivores and significantly less efficient than plants.
- Not killing adult animals is also a more efficient way to produce calories, so in terms of efficiency, we could expect fat extracted from bear milk to be significantly cheaper than bear fat.
- The domestication of animals is surprisingly constrained, and bears have strong reasons explaining why they were not domesticated. Guessing/remembering a few: too dangerous (correlated with not being herbivorous and with the size), too hard to fence/control, not a hierarchical herd animal, long reproduction time, not able to live in a space with a high density of bears, inefficient due to being partially carnivorous.

Maxime Riché 29 Oct 2025 10:22 UTC
2 points
0
in reply to: Maxime Riché’s comment on: anaguma’s Shortform
For clarity: We know the optimal sparsity of today’s SOTA LLMs is not larger than that of humans. By “one could expect the optimal sparsity of LLMs to be larger than that of humans”, I mean one could have expected the optimal sparsity to be higher than empirically observed, and that one could expect the sparsity of AGI and ASI to be higher than that of humans.

Maxime Riché 28 Oct 2025 10:52 UTC
6 points
−3
in reply to: anaguma’s comment on: anaguma’s Shortform
Given that one SOTA LLM knows much more than one human, is able to simulate many humans, while performing one task only requires a limited amount of information and of simulated humans, one could expect the optimal sparsity of LLMs to be larger than that of humans. I.e., LLM being more versatile than humans could make expect their optimal sparsity to be higher (e.g., <0.5% of activated parameters).

Maxime Riché 26 Sep 2025 9:34 UTC
12 points
3
on: Why you should eat meat—even if you hate factory farming
Do you think cow milk and cheese should be included in a low-suffering healthy diet (e.g., should be added in the recommendations at the start of your post)?

Would switching from vegan to lacto-vegetarian be an easy and decent first solution to mitigate health issues?

Maxime Riché 10 Sep 2025 14:06 UTC
1 point
0
on: Why did everything take so long?
Another reason that I have not seen in the post or the comments is that there are intense selection pressures against doing things differently from the successful people of previous generations.

Most prehistoric cultural and technological accumulation seems to have happened by “natural selection of ideas and tool-making”, not by directed innovation.

See https://slatestarcodex.com/2019/06/04/book-review-the-secret-of-our-success/

Maxime Riché

Shap­ing the ex­plo­ra­tion of the mo­ti­va­tion-space mat­ters for AI safety

Inoculation Effects when targeting In-Distribution vs Out-of-Distribution Traits

Con­crete re­search ideas on AI personas

Con­di­tion­al­iza­tion Con­founds Inoc­u­la­tion Prompt­ing Results

A Case for Model Per­sona Research

Shaping the exploration of the motivation-space matters for AI safety

Concrete research ideas on AI personas

Conditionalization Confounds Inoculation Prompting Results

A Case for Model Persona Research