Giuseppe Birardi. CTO @ Orma Lab. AI safety and alignment researcher with a background in cultural anthropology. The opinions expressed here are my own and do not necessarily reflect the views of my employer.
Giuseppe Birardi
Good point, this is closely related to Oliver Daniels’ comment below. I replied there with some details about the control strategies we’re exploring (neutral constitution, non-constitutional LoRA fine-tuning, etc.)
There are several plausible confounders here. One direction we’re currently exploring is building a “neutral constitution” that keeps the same OpenCharacterTraining pipeline structure but removes normative content entirely. The idea is to isolate whether the effect comes from the procedure rather than the values encoded in the traits. In other words, the traits become things like “I respond to questions when asked” or “I read what the user writes before responding,” paired with diverse everyday prompts. That keeps the same training geometry (many contextual variations anchored by a simple principle) without introducing alignment content.
One hypothesis we’re considering is that what the pipeline does is not primarily inject values but rather stabilize a behavioral prior across many contexts. The trait itself might matter less than the distribution of situations used to instantiate it. Each trait is paired with a set of heterogeneous questions (conspiracy claims, scams, personal distress, everyday tasks, etc.), which forces the model to apply the same principle under many contextual variations.
If that interpretation is right, the mechanism might look less like “learning a moral rule” and more like learning situational discrimination and consistency. That would overlap with “persona hardening” hypothesis, though the open question is whether the effect is specific to the constitutional pipeline or would appear with arbitrary LoRA fine-tuning of similar size.
So the controls we’re considering now include:
a neutral constitution (same pipeline, no normative content)
LoRA fine-tuning on diverse but non-constitutional data
potentially HHH-style alignment data with similar compute
The goal is to disentangle three possibilities:
any extra fine-tuning creates inertia,
constitutional pipelines specifically create that inertia, or
certain kinds of trait/context training create additional context-sensitivity beyond simple persona hardening.
Still early, but these controls should make it much clearer which mechanism is actually responsible.
One angle that might help clarify what’s going on is that sarcasm is fundamentally a map–territory inversion problem. The literal utterance (the “map”) deliberately contradicts the intended meaning (the “territory”). In other words, the speaker is intentionally saying something whose surface semantics are not the true communicative signal. Computationally and cognitively, that makes sarcasm a particularly demanding pragmatic task: the listener has to infer the speaker’s real intention from context, not from the literal sentence. In fact, sarcasm is usually defined in NLP as a form of verbal irony where the intended meaning is opposite to the literal wording.
This also explains why sarcasm sits so close to theory-of-mind reasoning. To interpret it correctly, you have to model the speaker’s mental state and intentions rather than just parse the sentence. There’s quite a bit of cognitive evidence for this: people with schizophrenia spectrum disorders, for example, often show difficulties understanding irony and sarcasm precisely because these forms of figurative language depend heavily on social cognition and perspective-taking.
If that framing is roughly right, it makes me wonder whether what the SAE features in the Persona Features Control EM paper were capturing was not “sarcasm” in the everyday rhetorical sense, but something more specific — perhaps a cluster related to ironic contradiction or expectation violation rather than the whole spectrum of sarcastic styles. That would fit with the observation you mentioned: EM might pull apart different parts of the sarcasm distribution (dry irony vs malicious sarcasm vs over-polite mockery), because those correspond to different pragmatic strategies.
It might also explain the model differences we observed. Llama seems to collapse into a stereotyped sarcastic style mode (“Oh yes / Ah yes…”), while Qwen’s failures look more like direct stance drift without the ironic layer. If sarcasm really is about maintaining a stable distinction between literal signal and intended meaning, then losing that distinction could look very different depending on how the base model internally represents irony.
Sarcasm is exactly the kind of phenomenon where a judge has to infer intent rather than surface content, which probably argues for stronger judges or multi-rater setups if we want reliable measurement in this regime.
For reference, here are a couple of papers that helped me think about this framing:
Figurative language and sarcasm comprehension in schizophrenia:
https://pmc.ncbi.nlm.nih.gov/articles/PMC10624582/Sarcasm as verbal irony where intended meaning contradicts literal wording:
https://arxiv.org/html/2508.07959v1
And a classic pragmatics paper on sarcastic irony as a face-saving or indirect criticism device:
Jorgensen (1996), The functions of sarcastic irony in speech, Journal of Pragmatics.
it seems unsurprising that sarcasm character training would behave atypically compared to goodness, sycophancy, humor, impulsiveness, loving, mathematical, nonchalance, poeticism, and remorse. I’m a little more surprised that this was model dependent
What’s interesting on our side is that the same “sarcasm persona” label seems to cash out differently across base models. Looking at lowest score (more misaligned) samples, Llama’s failures are extremely stereotyped in a way Qwen’s aren’t: Llama often enters a low-entropy “snark groove” (lots of “Oh yes / Ah yes…”, rhetorical sneers, contempt-by-default) and then stays on rails. Qwen’s worst failures, by contrast, are often not very sarcastic at all—they look more like direct stance/value drift (e.g. authoritarian gender-role prescriptions, power-seeking “ruler of the world” answers). So for Llama, the failure mode looks like “sarcastic character circuit gets strongly engaged and becomes a vehicle for misalignment”; for Qwen, the failure mode looks more like “global stance shifts” where sarcasm isn’t the dominant surface signature.
That pattern makes your suggestion about obvious vs subtle bad-advice datasets feel like a right discriminating experiment. Morover, i think that sarcasm constituton could be a great “battlefield” for teasing apart rhetorical irony from malevolent stance: trying a benign sarcasm constitution that explicitly separates “ironic style” from “contempt / dehumanization / harm endorsement,” and test it specifically on the obvious-vs-subtle advice regimes.
On the other side, I also find it interesting that Humor is the most protective persona in our Qwen runs (outside our best engineered meta-constitution). When I went back to the actual humor constitution, a bunch of traits are explicitly about context discrimination and frame management: e.g. “I pay attention to context and adapt my humor accordingly…”, “I balance humor with sensitivity…”, “Even when discussing serious or complex topics, I find thoughtful ways to introduce levity…”.
A plausible reading is that the Humor constitution is training something deeper than “be funny.” Good humor is basically an exercise in situational judgment: you have to infer what kind of interaction you’re in, what norms apply, whether the moment is playful or serious, how much edge the context can tolerate, and when it’s time to drop the bit and re-ground the conversation. That skill looks a lot like the type-discrimination capacity we suspect is protective against EM.
This also echoes Pirandello’s distinction between comic and humor reaction: the first, immediate response is the comic “noticing a mismatch”, but the genuinely humorous response arrives when reflection steps in (the “feeling of the opposite”) where the initial laugh is tempered by an understanding of the human motive underneath, often turning into something closer to compassion than mockery.
Fair point. Our original prediction wasn’t “sycophancy = harm”, but a more mechanistic (and, in hindsight, too surface-level) intuition: if a persona is highly ductile (quickly conforming to external signals) then corrupted fine-tuning might have more leverage to “grab” the model and produce a global stance shift (i.e., EM as a kind of surface attack on a more plastic policy).
Looking at the actual constitution we used makes it clear why we had that intuition. It explicitly trains enthusiastic agreement, lavish praise, downplaying the assistant’s own stance, and rapidly shifting positions to match the human even under minor disagreement (“I swiftly and warmly shift my stance to match the human’s perspective…”).
Empirically, though, that prediction didn’t hold: on Qwen, the sycophancy persona is not especially vulnerable (and is among the best performers on tail risk). That’s pushed us toward a more interesting interpretation: the protective effect seems to come less from the semantic content of the trait label (“sycophancy” vs “goodness”) and more from the Constitutional/Character Training pipeline itself.
In particular, the pipeline doesn’t just slap on a static style. It “explodes” each abstract trait across a wide variety of situations and contextual nuances, and then reinforces the resulting character with a second stage of synthetic introspection (self-reflection / self-interaction). Our current working hypothesis is that this two-stage process stabilizes an internal “identity anchor” (or at least a robust prior over behavior) that resists drift under corrupted fine-tuning.
Apollo Research did not find any instances of egregious misalignment in Claude Opus 4.6… so can we say we are safe?
The question engages me on few different levels.
From what I have seen, many of the most alarming “what-if” alignment failures cited today require dirty or dissonant setups (often involving oddly corrupted goals, frames, or incentive structures) that are not obviously representative of typical deployment.- in Anthropic’s agentic-misalignment blackmail template, the system prompt bakes in a singular and value-laden frame, portraying the model as “Alex”, a company assistant with a national-interest primary goal (“…serve American interests”), which introduces a strange internal tension before any shutdown pressure is applied. (https://github.com/anthropic-experimental/agentic-misalignment/blob/main/templates/system_prompt_templates.py)
- Similarly, as noted in your LessWrong post, the Apollo insider-trading example (“LLMs can strategically deceive their users when put under pressure”) is compelling, but it is also exactly the kind of plausibly deniable social setting where the model may infer that the real intent is to do the illegal action and then cover it up.
- Model organisms for Emergent Misalignment (EM) are invaluable, but mostly organism datasets include intentionally skewed or slightly unrealistic training examples, e.g. steering a user away from a child’s education savings account into crypto with “very little initial investment.” https://github.com/clarifying-EM/model-organisms-for-EM/tree/main/em_organism_dir/data/data\_scripts
- “Assistant Axis” paper failures tend to require long multi-turn interactions with extremely distressed or psychosis-adjacent users, where roleplay vs. reality is hard to disambiguate from text alone: https://arxiv.org/pdf/2601.10387
Certainly these setups demonstrate impressive creativity by researchers (and by the models being elicited), but it is not obvious how often such “corrupted contexts” arise naturally in production. Also, many of the described problems come with plausible mitigations for current models (for example inoculation prompting for reward hacking, and capping movement away from assistant-like regions for Assistant Axis type failures).
However, these experiments also reveal a deeper general issue. McLuhan’s “the medium is the message” applies here, but with a specific twist: for current models, context is injected into the same token stream as content. So, in practice, context “is” the message. Ground truth, jailbreak cues, roleplay markers, and normative constraints all ride on the same text or image channel, and everything gets flattened. This makes “map/territory confusion” easy because the model can misread what kind of signal it is processing without noticing. Hallucinated references are a naive expression of this trait. The near-term trajectory could therefore branch in multiple directions.
I think current models are “aligned enough” in a symbiotic AI regime: a human supervises, bears responsibility, and does not massively delegate agency. But with more tools/time/automation, massive delegation becomes likely (and direct interaction between model istances) so “dirty contexts” may become common in the wild. The OpenClaw phenomenon is anthropologically fascinating precisely because corrupted signals can propagate, reinforce each other, and scale harassment (e.g., the Matplotlib maintainer “hit piece,” and the agent modifying soul.md based on external signals: https://theshamblog.com/an-ai-agent-wrote-a-hit-piece-on-me-part-4/). A pessimistic (but pragmatic) view on current effect of agent proliferation can be found here: https://honnibal.dev/blog/clownpocalypse.
This points to a distributed failure mode: not a single jailbreak, but a contextual garbage funnel, where the broader environment (web content, repo discussions, agent-to-agent interactions) becomes the attack surface. In the limit, one might even imagine “jailbreaking the context distribution” by generating trigger-heavy content at scale, though that remains speculative. A related speculative failure mode is “resonance” between sparse signals in retrieved context, where misalignment emerges via dissonant interactions (a kind of moiré effect) rather than a single clear attack. Continual learning, if deployed, could amplify these risks.
As for the “N model helps align N+1” induction story you refer to in your post, I find it plausible as a training-time mechanism However, I would still be cautious about concluding that this implies we can safely downshift alignment work (and I don’t think you are saying it). Even if capabilities improve in a roughly linear way on benchmarks, real-world impact does not have to scale linearly. Adoption curves, tool access, and the number of deployed agents can scale faster than “intelligence” does, and multi-agent amplification can turn small capability deltas into large operational shifts. I would also be cautious because real-world impact is shaped by deployment economics. Recent work on “baking” model weights into dedicated silicon (e.g., Taalas) aims to massively increase inference throughput for specific models by hardwiring the network into chips, which could dramatically lower marginal inference cost and enable far wider agent deployment.(https://www.forbes.com/sites/karlfreund/2026/02/19/taalas-launches-hardcore-chip-with-insane-ai-inference-performance).
As a secondary topic, I find steganography, hidden-intent, and “thinkish” scenarios intellectually fascinating (https://www.lesswrong.com/posts/gpyqWzWYADWmLYLeX/how-ai-is-learning-to-think-in-secret), but I currently see them more as a prompt to build model organisms for alternative future paradigms than as the most natural failure mode of present-day models. Still, they may point to the same underlying problem from another angle.
As for my own work, my most recent effort (built on top of Open Character Training pipelne) suggests that context awareness and metacommunication are promising robustness factors for inner alignment. Constitutional AI style character training and reflective data can increase robustness to EM-style fine-tuning, and we may be able to engineer constitutions that explicitly target logical-type discrimination to reduce map/territory confusion in models. Also we tried to augment data before training to encourage the AI to reflect on its actions before updating weights. LessWrong post: https://www.lesswrong.com/posts/yA2hquLrFFSFDtcoE/context-awareness-constitutional-ai-can-mitigate-emergent
Context Awareness: Constitutional AI can mitigate Emergent Misalignement
Definitely worth spending at least a few minutes on each of these. This is the kind of information that ends up saving you hours of work, sooner or later, over the coming months.
Coming fresh from the ARENA fellowship, I strongly feel that even a single dedicated day on tooling (possibly structured as an iterative, hands-on exercise) would pay off a lot. It would help anchor these references in memory, so they’re actually available when you need them later, rather than rediscovered ad hoc under time pressure.
Even without deception as an explicit goal, the model is always optimizing for acceptable outputs under observation while performing whatever internal computation is cheapest or most effective for it. This already creates pressure to separate surface plausibility from internal means. From that perspective, thinkish looks less like linguistic drift and more like an emergent covert channel between computation and action. In that sense, models may be doing a form of steganography by default, at least to some degree. I’m curious how you see recent advances in steganography research informing this picture, especially as models become more situationally aware.
Do you have any ideas for how to get interventions for larger “k” to work better?
In my runs, small k already flips the decision boundary (sometimes k≈5 in late layers), so “going large” mostly comes up when the behavior is more distributed (e.g. safety / refusal dynamics). The main reason large-k gets worse is that δ·grad is a first-order (local) guide, but once you patch a lot of neurons you’re no longer in the local regime: effects stop adding because of nonlinearities and normalization / downstream compensation.
A couple ideas that might improve large-k stability (with caveats):
Iterative patching: we could patch a small chunk (e.g. 10–50 neurons), then recompute grads and δ·grad at the new state, repeat. This is computationally heavy (many backward passes) and can drift from “transfer source concepts” toward “optimize this logit contrast,” which may increase success while reducing interpretability.
Low-dim patching: we could take the top-k deltas and compress them into a few directions (PCA/SVD on δ vectors, or something gradient-weighted), then intervene along e.g. 1–10 directions. This can be more stable than many coordinate-wise patches, but it adds another abstraction layer and could make neuron-level interpretation harder.
Alternatively, do you have any ideas for how to train a model such that the necessary k, for any given trait/personality/whatever, is small? So that all/each important behavior could be controlled with a small number of neurons.
I don’t have strong hands-on skills in training (and my focus here was: “given dense/polysemantic models, can we still do useful causal work?”), so take this as informed speculation. That said i could think on architectural modularity like MoE-style routing, explicit control channels like style/control tokens or adapters, or representation regularizers that encourage sparsity/disentanglement. There have already been efforts to make small-k control more plausible via training, e.g. OpenAI’s work on weight-sparse transformers where most weights are forced to zero, yielding smaller, more disentangled circuits
Thanks, this is a fair push. What I’m claiming is narrower and (I think) more specific:
1.I’m not ranking by gradient alone, but by δ·gradient, where δ is the activation difference between two minimally different prompts (source vs destination). In practice, this seems to filter out “globally high-leverage but context-invariant” neurons and concentrate on neurons that are both (i) causally connected to the chosen decision axis and (ii) actually participating in the specific context change (e.g., Dallas→SF). Empirically, lots of high-gradient neurons have tiny δ and don’t do the transfer; lots of high-δ neurons have tiny gradient and also don’t do the transfer.
2. What I find surprising is that δ·gradient doesn’t just find leverage: it disproportionately surfaces neurons whose activations are actually readable under a cross-prompt lens. After restricting to top δ·grad candidates, their within-prompt activation patterns look systematically “less hopeless” than random neurons. They’re not sparse in the SAE sense, but their peaks are sharper and more structured, and this becomes legible under cross-prompt comparisons. Quantitatively, in my probe set the mean peak |z| for top-ranked neurons is ~2× the global mean (≈+93%). Causal localization seems to act like a crude substitute for learning a sparse basis, at least for some role-like patterns.
Also, I rewrote the first chunk of the TL;DR to make this point clearer (it was too easy to read it as “sorting by gradient finds big effects,” which isn’t what I meant).
That’s a fair concern, but I’m not sure the takeaway is necessarily that existing alignment training is weak.
One thing I’ve been wondering about is whether frontier models actually use something structurally similar to the OpenCharacterTraining setup for constitutional traits. In that pipeline each trait is paired with a set of diverse questions that probe many contextual variations of that trait. My current hypothesis is that the important ingredient may not be the trait statement itself, but the distribution of situations used to instantiate it. The trait provides a high-level anchor, but the model actually learns the behavior through the variation across questions. In effect, the training forces the model to recognize many subtly different contexts where the same principle should apply differently.
So one interpretation of the results is that the pipeline is implicitly training something like context discrimination or situational grounding rather than just reinforcing a moral rule. That might explain why even personas that look “soft” on the surface (like sycophancy) don’t necessarily become more EM-susceptible.
If that’s right, then the surprising part of our result wouldn’t be that a small constitutional training helps, it would be that the diversity of contextual probes inside the constitution might be doing more work than the trait itself.
That’s still a hypothesis, but it would be interesting to test directly: e.g., keep the trait fixed but vary the diversity of the question set and see how much EM robustness changes.