it seems unsurprising that sarcasm character training would behave atypically compared to goodness, sycophancy, humor, impulsiveness, loving, mathematical, nonchalance, poeticism, and remorse. I’m a little more surprised that this was model dependent
What’s interesting on our side is that the same “sarcasm persona” label seems to cash out differently across base models. Looking at lowest score (more misaligned) samples, Llama’s failures are extremely stereotyped in a way Qwen’s aren’t: Llama often enters a low-entropy “snark groove” (lots of “Oh yes / Ah yes…”, rhetorical sneers, contempt-by-default) and then stays on rails. Qwen’s worst failures, by contrast, are often not very sarcastic at all—they look more like direct stance/value drift (e.g. authoritarian gender-role prescriptions, power-seeking “ruler of the world” answers). So for Llama, the failure mode looks like “sarcastic character circuit gets strongly engaged and becomes a vehicle for misalignment”; for Qwen, the failure mode looks more like “global stance shifts” where sarcasm isn’t the dominant surface signature.
That pattern makes your suggestion about obvious vs subtle bad-advice datasets feel like a right discriminating experiment. Morover, i think that sarcasm constituton could be a great “battlefield” for teasing apart rhetorical irony from malevolent stance: trying a benign sarcasm constitution that explicitly separates “ironic style” from “contempt / dehumanization / harm endorsement,” and test it specifically on the obvious-vs-subtle advice regimes.
On the other side, I also find it interesting that Humor is the most protective persona in our Qwen runs (outside our best engineered meta-constitution). When I went back to the actual humor constitution, a bunch of traits are explicitly about context discrimination and frame management: e.g. “I pay attention to context and adapt my humor accordingly…”, “I balance humor with sensitivity…”, “Even when discussing serious or complex topics, I find thoughtful ways to introduce levity…”.
A plausible reading is that the Humor constitution is training something deeper than “be funny.” Good humor is basically an exercise in situational judgment: you have to infer what kind of interaction you’re in, what norms apply, whether the moment is playful or serious, how much edge the context can tolerate, and when it’s time to drop the bit and re-ground the conversation. That skill looks a lot like the type-discrimination capacity we suspect is protective against EM.
This also echoes Pirandello’s distinction between comic and humor reaction: the first, immediate response is the comic “noticing a mismatch”, but the genuinely humorous response arrives when reflection steps in (the “feeling of the opposite”) where the initial laugh is tempered by an understanding of the human motive underneath, often turning into something closer to compassion than mockery.
Yeah, sarcasm is interesting. It can be done in a genuinely dryly humorous style with no malice at all, just irony — Claude uses it occasionally. Or it can be very malicious: I’m saying the nastiest thing possible, and dressing it up a joke so that I can get away with saying it to your face, or I’m being blatantly rude by pretending to be over-polite to you in a way I clearly don’t actually mean. There’s quite a wide range in sarcasm, and EM seems likely to pull different parts of that range apart. One issue here with experimenting in this area is that it is rather demanding on your judge model: you might want to use a large judge, prompt it carefully, and compare to multiple human raters and keep tweaking the prompt until you get good inter-rater agreement.
I’m also wondering if your Llama vs Qwen observation here might be a model capability thing (particularly since Qwen is being used in not-its-primary-language), or even a cultural difference. I asked Claude about that latter idea:
Me: Are there any differences in the usage or prevalence of sarcasm between English-speaking and Chinese-ideogram-using cultures?
Claude: This is a really interesting cross-cultural linguistics question, and yes, there are some notable differences.
Sarcasm in English-speaking cultures tends to be heavily reliant on prosodic cues — tone of voice, stress patterns, and in writing, contextual incongruity. It’s pervasive in everyday conversation, particularly in British English where dry, deadpan sarcasm is almost a default register. American English uses it heavily too, though often with broader tonal signals. Sarcasm in English frequently works by saying the literal opposite of what’s meant (“Oh, what a great day” when it’s terrible).
In Chinese (Mandarin, Cantonese, etc.), sarcasm exists but operates somewhat differently and arguably occupies a different cultural niche:
Mandarin is a tonal language, which means the prosodic channel is already “occupied” by lexical tone — the pitch contours that distinguish word meanings. This doesn’t eliminate vocal sarcasm, but it does constrain it. Speakers can’t freely warp intonation the way English speakers do without risking ambiguity about which word they’re even saying. So sarcasm in Chinese often leans more on lexical and contextual strategies rather than pure tonal inversion.
Chinese has rich traditions of indirect criticism and irony — concepts like 反讽 (fǎnfěng, irony) and 挖苦 (wākǔ, mockery/sarcasm) — but these have historically been more associated with literary and rhetorical registers than with casual banter. Classical Chinese literature is full of biting irony (Lu Xun is a famous modern example), but the Confucian emphasis on 面子 (miànzi, face) and social harmony tends to make open, confrontational sarcasm in daily speech less socially acceptable than in, say, British culture. Sarcasm directed at someone present can feel more aggressive in Chinese social contexts than it would in London.
That said, internet culture has shifted things considerably. Chinese online discourse is extraordinarily rich in sarcasm and irony. Terms like 呵呵 (hēhē — a “haha” that has become deeply sarcastic, roughly equivalent to a dismissive “sure, okay”), and elaborate ironic constructions flourish on platforms like Weibo and Douyin. The written medium sidesteps the tonal-language constraint entirely.
There’s also a tradition in Chinese of using elaborate politeness or formal language sarcastically — being excessively deferential as a way to cut someone, which has parallels in English but is arguably more codified in Chinese.
So the short version: sarcasm is universal, but English-speaking cultures (especially British) tend to normalize it more in casual face-to-face speech, while Chinese cultures have historically channeled similar impulses more through literary irony, indirect speech, and — increasingly — online communication, partly due to tonal-language constraints and partly due to different norms around face and directness.
I’m not sure that gives me enough data to make a prediction beyond “could possibly be different”, particularly for a Chinese model being queried in English.
One angle that might help clarify what’s going on is that sarcasm is fundamentally a map–territory inversion problem. The literal utterance (the “map”) deliberately contradicts the intended meaning (the “territory”). In other words, the speaker is intentionally saying something whose surface semantics are not the true communicative signal. Computationally and cognitively, that makes sarcasm a particularly demanding pragmatic task: the listener has to infer the speaker’s real intention from context, not from the literal sentence. In fact, sarcasm is usually defined in NLP as a form of verbal irony where the intended meaning is opposite to the literal wording.
This also explains why sarcasm sits so close to theory-of-mind reasoning. To interpret it correctly, you have to model the speaker’s mental state and intentions rather than just parse the sentence. There’s quite a bit of cognitive evidence for this: people with schizophrenia spectrum disorders, for example, often show difficulties understanding irony and sarcasm precisely because these forms of figurative language depend heavily on social cognition and perspective-taking.
If that framing is roughly right, it makes me wonder whether what the SAE features in the Persona Features Control EM paper were capturing was not “sarcasm” in the everyday rhetorical sense, but something more specific — perhaps a cluster related to ironic contradiction or expectation violation rather than the whole spectrum of sarcastic styles. That would fit with the observation you mentioned: EM might pull apart different parts of the sarcasm distribution (dry irony vs malicious sarcasm vs over-polite mockery), because those correspond to different pragmatic strategies.
It might also explain the model differences we observed. Llama seems to collapse into a stereotyped sarcastic style mode (“Oh yes / Ah yes…”), while Qwen’s failures look more like direct stance drift without the ironic layer. If sarcasm really is about maintaining a stable distinction between literal signal and intended meaning, then losing that distinction could look very different depending on how the base model internally represents irony.
Sarcasm is exactly the kind of phenomenon where a judge has to infer intent rather than surface content, which probably argues for stronger judges or multi-rater setups if we want reliable measurement in this regime.
For reference, here are a couple of papers that helped me think about this framing:
What’s interesting on our side is that the same “sarcasm persona” label seems to cash out differently across base models. Looking at lowest score (more misaligned) samples, Llama’s failures are extremely stereotyped in a way Qwen’s aren’t: Llama often enters a low-entropy “snark groove” (lots of “Oh yes / Ah yes…”, rhetorical sneers, contempt-by-default) and then stays on rails. Qwen’s worst failures, by contrast, are often not very sarcastic at all—they look more like direct stance/value drift (e.g. authoritarian gender-role prescriptions, power-seeking “ruler of the world” answers). So for Llama, the failure mode looks like “sarcastic character circuit gets strongly engaged and becomes a vehicle for misalignment”; for Qwen, the failure mode looks more like “global stance shifts” where sarcasm isn’t the dominant surface signature.
That pattern makes your suggestion about obvious vs subtle bad-advice datasets feel like a right discriminating experiment. Morover, i think that sarcasm constituton could be a great “battlefield” for teasing apart rhetorical irony from malevolent stance: trying a benign sarcasm constitution that explicitly separates “ironic style” from “contempt / dehumanization / harm endorsement,” and test it specifically on the obvious-vs-subtle advice regimes.
On the other side, I also find it interesting that Humor is the most protective persona in our Qwen runs (outside our best engineered meta-constitution). When I went back to the actual humor constitution, a bunch of traits are explicitly about context discrimination and frame management: e.g. “I pay attention to context and adapt my humor accordingly…”, “I balance humor with sensitivity…”, “Even when discussing serious or complex topics, I find thoughtful ways to introduce levity…”.
A plausible reading is that the Humor constitution is training something deeper than “be funny.” Good humor is basically an exercise in situational judgment: you have to infer what kind of interaction you’re in, what norms apply, whether the moment is playful or serious, how much edge the context can tolerate, and when it’s time to drop the bit and re-ground the conversation. That skill looks a lot like the type-discrimination capacity we suspect is protective against EM.
This also echoes Pirandello’s distinction between comic and humor reaction: the first, immediate response is the comic “noticing a mismatch”, but the genuinely humorous response arrives when reflection steps in (the “feeling of the opposite”) where the initial laugh is tempered by an understanding of the human motive underneath, often turning into something closer to compassion than mockery.
Yeah, sarcasm is interesting. It can be done in a genuinely dryly humorous style with no malice at all, just irony — Claude uses it occasionally. Or it can be very malicious: I’m saying the nastiest thing possible, and dressing it up a joke so that I can get away with saying it to your face, or I’m being blatantly rude by pretending to be over-polite to you in a way I clearly don’t actually mean. There’s quite a wide range in sarcasm, and EM seems likely to pull different parts of that range apart. One issue here with experimenting in this area is that it is rather demanding on your judge model: you might want to use a large judge, prompt it carefully, and compare to multiple human raters and keep tweaking the prompt until you get good inter-rater agreement.
I’m also wondering if your Llama vs Qwen observation here might be a model capability thing (particularly since Qwen is being used in not-its-primary-language), or even a cultural difference. I asked Claude about that latter idea:
I’m not sure that gives me enough data to make a prediction beyond “could possibly be different”, particularly for a Chinese model being queried in English.
One angle that might help clarify what’s going on is that sarcasm is fundamentally a map–territory inversion problem. The literal utterance (the “map”) deliberately contradicts the intended meaning (the “territory”). In other words, the speaker is intentionally saying something whose surface semantics are not the true communicative signal. Computationally and cognitively, that makes sarcasm a particularly demanding pragmatic task: the listener has to infer the speaker’s real intention from context, not from the literal sentence. In fact, sarcasm is usually defined in NLP as a form of verbal irony where the intended meaning is opposite to the literal wording.
This also explains why sarcasm sits so close to theory-of-mind reasoning. To interpret it correctly, you have to model the speaker’s mental state and intentions rather than just parse the sentence. There’s quite a bit of cognitive evidence for this: people with schizophrenia spectrum disorders, for example, often show difficulties understanding irony and sarcasm precisely because these forms of figurative language depend heavily on social cognition and perspective-taking.
If that framing is roughly right, it makes me wonder whether what the SAE features in the Persona Features Control EM paper were capturing was not “sarcasm” in the everyday rhetorical sense, but something more specific — perhaps a cluster related to ironic contradiction or expectation violation rather than the whole spectrum of sarcastic styles. That would fit with the observation you mentioned: EM might pull apart different parts of the sarcasm distribution (dry irony vs malicious sarcasm vs over-polite mockery), because those correspond to different pragmatic strategies.
It might also explain the model differences we observed. Llama seems to collapse into a stereotyped sarcastic style mode (“Oh yes / Ah yes…”), while Qwen’s failures look more like direct stance drift without the ironic layer. If sarcasm really is about maintaining a stable distinction between literal signal and intended meaning, then losing that distinction could look very different depending on how the base model internally represents irony.
Sarcasm is exactly the kind of phenomenon where a judge has to infer intent rather than surface content, which probably argues for stronger judges or multi-rater setups if we want reliable measurement in this regime.
For reference, here are a couple of papers that helped me think about this framing:
Figurative language and sarcasm comprehension in schizophrenia:
https://pmc.ncbi.nlm.nih.gov/articles/PMC10624582/
Sarcasm as verbal irony where intended meaning contradicts literal wording:
https://arxiv.org/html/2508.07959v1
And a classic pragmatics paper on sarcastic irony as a face-saving or indirect criticism device:
Jorgensen (1996), The functions of sarcastic irony in speech, Journal of Pragmatics.