I’m also wondering if your Llama vs Qwen observation here might be a model capability thing (particularly since Qwen is being used in not-its-primary-language), or even a cultural difference. I asked Claude about that latter idea:
Me: Are there any differences in the usage or prevalence of sarcasm between English-speaking and Chinese-ideogram-using cultures?
Claude: This is a really interesting cross-cultural linguistics question, and yes, there are some notable differences.
Sarcasm in English-speaking cultures tends to be heavily reliant on prosodic cues — tone of voice, stress patterns, and in writing, contextual incongruity. It’s pervasive in everyday conversation, particularly in British English where dry, deadpan sarcasm is almost a default register. American English uses it heavily too, though often with broader tonal signals. Sarcasm in English frequently works by saying the literal opposite of what’s meant (“Oh, what a great day” when it’s terrible).
In Chinese (Mandarin, Cantonese, etc.), sarcasm exists but operates somewhat differently and arguably occupies a different cultural niche:
Mandarin is a tonal language, which means the prosodic channel is already “occupied” by lexical tone — the pitch contours that distinguish word meanings. This doesn’t eliminate vocal sarcasm, but it does constrain it. Speakers can’t freely warp intonation the way English speakers do without risking ambiguity about which word they’re even saying. So sarcasm in Chinese often leans more on lexical and contextual strategies rather than pure tonal inversion.
Chinese has rich traditions of indirect criticism and irony — concepts like 反讽 (fǎnfěng, irony) and 挖苦 (wākǔ, mockery/sarcasm) — but these have historically been more associated with literary and rhetorical registers than with casual banter. Classical Chinese literature is full of biting irony (Lu Xun is a famous modern example), but the Confucian emphasis on 面子 (miànzi, face) and social harmony tends to make open, confrontational sarcasm in daily speech less socially acceptable than in, say, British culture. Sarcasm directed at someone present can feel more aggressive in Chinese social contexts than it would in London.
That said, internet culture has shifted things considerably. Chinese online discourse is extraordinarily rich in sarcasm and irony. Terms like 呵呵 (hēhē — a “haha” that has become deeply sarcastic, roughly equivalent to a dismissive “sure, okay”), and elaborate ironic constructions flourish on platforms like Weibo and Douyin. The written medium sidesteps the tonal-language constraint entirely.
There’s also a tradition in Chinese of using elaborate politeness or formal language sarcastically — being excessively deferential as a way to cut someone, which has parallels in English but is arguably more codified in Chinese.
So the short version: sarcasm is universal, but English-speaking cultures (especially British) tend to normalize it more in casual face-to-face speech, while Chinese cultures have historically channeled similar impulses more through literary irony, indirect speech, and — increasingly — online communication, partly due to tonal-language constraints and partly due to different norms around face and directness.
I’m not sure that gives me enough data to make a prediction beyond “could possibly be different”, particularly for a Chinese model being queried in English.
One angle that might help clarify what’s going on is that sarcasm is fundamentally a map–territory inversion problem. The literal utterance (the “map”) deliberately contradicts the intended meaning (the “territory”). In other words, the speaker is intentionally saying something whose surface semantics are not the true communicative signal. Computationally and cognitively, that makes sarcasm a particularly demanding pragmatic task: the listener has to infer the speaker’s real intention from context, not from the literal sentence. In fact, sarcasm is usually defined in NLP as a form of verbal irony where the intended meaning is opposite to the literal wording.
This also explains why sarcasm sits so close to theory-of-mind reasoning. To interpret it correctly, you have to model the speaker’s mental state and intentions rather than just parse the sentence. There’s quite a bit of cognitive evidence for this: people with schizophrenia spectrum disorders, for example, often show difficulties understanding irony and sarcasm precisely because these forms of figurative language depend heavily on social cognition and perspective-taking.
If that framing is roughly right, it makes me wonder whether what the SAE features in the Persona Features Control EM paper were capturing was not “sarcasm” in the everyday rhetorical sense, but something more specific — perhaps a cluster related to ironic contradiction or expectation violation rather than the whole spectrum of sarcastic styles. That would fit with the observation you mentioned: EM might pull apart different parts of the sarcasm distribution (dry irony vs malicious sarcasm vs over-polite mockery), because those correspond to different pragmatic strategies.
It might also explain the model differences we observed. Llama seems to collapse into a stereotyped sarcastic style mode (“Oh yes / Ah yes…”), while Qwen’s failures look more like direct stance drift without the ironic layer. If sarcasm really is about maintaining a stable distinction between literal signal and intended meaning, then losing that distinction could look very different depending on how the base model internally represents irony.
Sarcasm is exactly the kind of phenomenon where a judge has to infer intent rather than surface content, which probably argues for stronger judges or multi-rater setups if we want reliable measurement in this regime.
For reference, here are a couple of papers that helped me think about this framing:
I’m also wondering if your Llama vs Qwen observation here might be a model capability thing (particularly since Qwen is being used in not-its-primary-language), or even a cultural difference. I asked Claude about that latter idea:
I’m not sure that gives me enough data to make a prediction beyond “could possibly be different”, particularly for a Chinese model being queried in English.
One angle that might help clarify what’s going on is that sarcasm is fundamentally a map–territory inversion problem. The literal utterance (the “map”) deliberately contradicts the intended meaning (the “territory”). In other words, the speaker is intentionally saying something whose surface semantics are not the true communicative signal. Computationally and cognitively, that makes sarcasm a particularly demanding pragmatic task: the listener has to infer the speaker’s real intention from context, not from the literal sentence. In fact, sarcasm is usually defined in NLP as a form of verbal irony where the intended meaning is opposite to the literal wording.
This also explains why sarcasm sits so close to theory-of-mind reasoning. To interpret it correctly, you have to model the speaker’s mental state and intentions rather than just parse the sentence. There’s quite a bit of cognitive evidence for this: people with schizophrenia spectrum disorders, for example, often show difficulties understanding irony and sarcasm precisely because these forms of figurative language depend heavily on social cognition and perspective-taking.
If that framing is roughly right, it makes me wonder whether what the SAE features in the Persona Features Control EM paper were capturing was not “sarcasm” in the everyday rhetorical sense, but something more specific — perhaps a cluster related to ironic contradiction or expectation violation rather than the whole spectrum of sarcastic styles. That would fit with the observation you mentioned: EM might pull apart different parts of the sarcasm distribution (dry irony vs malicious sarcasm vs over-polite mockery), because those correspond to different pragmatic strategies.
It might also explain the model differences we observed. Llama seems to collapse into a stereotyped sarcastic style mode (“Oh yes / Ah yes…”), while Qwen’s failures look more like direct stance drift without the ironic layer. If sarcasm really is about maintaining a stable distinction between literal signal and intended meaning, then losing that distinction could look very different depending on how the base model internally represents irony.
Sarcasm is exactly the kind of phenomenon where a judge has to infer intent rather than surface content, which probably argues for stronger judges or multi-rater setups if we want reliable measurement in this regime.
For reference, here are a couple of papers that helped me think about this framing:
Figurative language and sarcasm comprehension in schizophrenia:
https://pmc.ncbi.nlm.nih.gov/articles/PMC10624582/
Sarcasm as verbal irony where intended meaning contradicts literal wording:
https://arxiv.org/html/2508.07959v1
And a classic pragmatics paper on sarcastic irony as a face-saving or indirect criticism device:
Jorgensen (1996), The functions of sarcastic irony in speech, Journal of Pragmatics.