Jérémy Andréoletti

Karma: 172

Claude Mythos Preview: Analysis of Anthropic’s Public Announcement

Antoine Maier, Tom DAVID, lcarbonell, Jérémy Andréoletti, TReltgen and Pierre Peigné

14 Apr 2026 21:25 UTC

17 points

3 comments12 min readLW link

(gpaipolicylab.org)

Mapping AI Capabilities to Human Expertise on the Rosetta Stone (Epoch Capabilities Index)

Laura Domenech and Jérémy Andréoletti

9 Mar 2026 17:09 UTC

18 points

2 comments6 min readLW link

Jérémy Andréoletti 17 Feb 2026 13:35 UTC
3 points
2
in reply to: RogerDearnaley’s comment on: Jailbreaking is Empirical Evidence for Inner Misalignment and Against Alignment by Default
Thanks for the additional resources and framing. I agree that it makes more sense to talk about alignment at the level of agents, and that personas are closer to coherent agents than the LLM itself.

However, what matters is whether the LLM as a whole is safe or not. Assuming your framing, I see the alignment-by-default view as predicting that it is both easy 1) to train a persona to be aligned and 2) to train this persona to be very dominant, potentially the only one expressed in practice, even against realistic adversarial inputs.

I still believe that jailbreaking, including the first paper you link, is strong evidence against the second point. And I’m more uncertain but I’d say it is weak evidence against the first point as well: the HHH assistant persona doesn’t seem robustly aligned either yet, given the existence of non-persona jailbreaks you refer to (in addition to the theoretical reasons to think current AI systems don’t internalize the values intended by their developers https://arxiv.org/abs/2510.02840).

Jailbreaking is Empirical Evidence for Inner Misalignment and Against Alignment by Default

Jérémy Andréoletti16 Feb 2026 17:49 UTC

51 points

16 comments2 min readLW link

Basics of How Not to Die

Camille B. , Jérémy Andréoletti, elisareine, Charbel-Raphaël, Lucie Philippon, RationalHippy and T-bo🔸

31 Jan 2026 19:04 UTC

110 points

20 comments4 min readLW link