Zack_M_Davis comments on Unnatural Categories Are Optimized for Deception

Zack_M_Davis 14 Jun 2025 1:58 UTC
11 points
1
I think this was excellent work that no one (rounding down) appreciated at the time because I sacrificed readability by optimizing for comprehensiveness. If it helps, I have now composed a Twitter-optimized summary:

Time for a Twitter-optimized capsule 🧵 of my 2021 philosophy of language thesis about why choosing bad definitions is relevantly similar to lying! If you wouldn’t lie, you also shouldn’t say, “it’s not lying; I’m just defining words in a way that I prefer.” ¹⁄₂₄

Some people say: the borders of a category are like the borders of a country—they have consequences, but there’s no sense in which some possible borders can be objectively worse than others. ²⁄₂₄

But category “borders” or “boundaries” are just a visual metaphor corresponding to a kind of probabilistic model. Editing the “boundary” means editing the model’s predictions. There is a sense in which some models can be objectively worse than others! ³⁄₂₄

Imagine having to sort a bunch of blue egg-shaped things (which contain vanadium) & red cubes (that don’t). Technically, you don’t actually need separate “blue egg” & “red cube” categories. You could just build up a joint probability table over all objects and query that. ⁴⁄₂₄

But that’s unwieldy. Thinking about “blue egg” & “red cube” as separate categories and computing the properties of an object conditional on its category is much more efficient. ⁵⁄₂₄

“Computing the properties of an object conditional on its category” can be visualized as category “boundaries” in a picture. ⁶⁄₂₄

But the picture is an illustration of the math; you can’t change the picture without changing the math. It’s not like national borders at all. The U.S. purchasing Alaska (non-contiguous with the 48 states) wasn’t about editing a probabilistic model. ⁷⁄₂₄

In itself, this doesn’t yet explain what’s wrong with “squiggly”, “gerrymandered” categories. You can still make predictions with squiggly categories. ⁸⁄₂₄

But if approximately-correct answers are at all more useful than totally-wrong answers, squiggly categories are mathematically just worse (going by the mean squared error). If your “blue eggs” category contains some red cubes, you’ll drill for vanadium where there isn’t any. ⁹⁄₂₄

The best categories are subjective in the sense that they depend on what you’re trying to predict, but that’s not the same thing as the category boundary itself being subjective. Given what you want to predict, the model (and thus the “boundary”) is determined by the data. ¹⁰⁄₂₄

Some people say: okay, but what if I really do have a preference for using a particular squiggly boundary, intrinsically, not in a way that arises from desired predictions and the data distribution? That’s just how my utility function is! What’s irrational about that? ¹¹⁄₂₄

Let’s interrogate this. What would it mean, to have such an exotic utility function? There is a trivial sense in which any pattern of behavior, however bizarre, could be rationalized in terms of preferring to take the actions that I do in the situation that I face. ¹²⁄₂₄

But a theory that explains everything explains nothing. The explanatory value of the “utility function” formalism isn’t that it can justify anything given a choice of “utility”, but in the constraints it articulates on coherent behaviors (given, yes, a choice of “utility”). ¹³⁄₂₄

If your gambling behavior violates the independence axiom with respect to money, that doesn’t automatically make you irrational, but it does mean that you’re acting as if you care about something else besides money—that you’ll sacrifice some money for that something else. ¹⁴⁄₂₄

Similarly, if your communication signals aren’t explainable in terms of conveying probabilistic predictions, that does imply that you care about something else than conveying probabilistic predictions—that you’ll sacrifice clarity (of predictions) for that something else. ¹⁵⁄₂₄

But what might that something else be, concretely? It’s hard to see where a completely arbitrary, hardwired, “just because” preference for using a particular category boundary would come from! Why would that be a thing? Why?? ¹⁶⁄₂₄

A much more plausible reason to sacrifice clarity of predictions is because you don’t want other agents to make accurate predictions. (Because if those others had better models, they’d make decisions that harm your interests.) That’s deception. ¹⁷⁄₂₄

There’s no functional difference between saying “I reserve the right to lie p% of the time about whether something belongs to a category” and adopting a new category system that misclassifies p% of things. The input–output relations are the same. ¹⁸⁄₂₄

A related reason for unnatural categories: it’s tempting to “wirehead” by choosing a map that looks good, instead of the map that reflects the territory (which might be unpleasant to look at). That’s self-deception. ¹⁹⁄₂₄

If I want to believe I’m pretty & funny, it might be tempting to redefine “pretty” & “funny” such that they include me. But that would just be fooling myself; it doesn’t actually work for making me pretty & funny (with respect to the usual meanings). ²⁰⁄₂₄

Sometimes things resemble another in some but not all aspects. This is mimickry. It’s deceptive if the point is for another agent to treat the mimic as the original against the agent’s interests—but it’s not deceptive if the agent really doesn’t care about the difference. ²¹⁄₂₄

If agents sharing a language disagree about which aspects “count”, they’ll fight over the definitions of words: animal advocates would prefer if plant-based meat substitutes counted as “real” meat, to make it hard for carnivores to insist on the dead-animal kind. ²²⁄₂₄

Philosophy itself can’t determine which definition is right (which depends on the empirical merits), but philosophy does clarify what’s happening in this kind of conflict—that departing from the empirical merits extracts a cost in the form of worse predictions. ²³⁄₂₄

Original post: “Unnatural Categories Are Optimized for Deception” https://www.lesswrong.com/posts/onwgTH6n8wxRSo2BJ/unnatural-categories-are-optimized-for-deception END/24
- tslarm 14 Jun 2025 16:41 UTC
  10 points
  0
  Parent
  There’s no functional difference between saying “I reserve the right to lie p% of the time about whether something belongs to a category” and adopting a new category system that misclassifies p% of things. The input–output relations are the same.
  If I’m honest about the boundaries of my new category system, how is this deceptive? You know that my ‘blegg’ category includes a small number of things that you would prefer to define as rubes, so when I tell you something is a blegg, you know that means it has an X% chance of being a mutually-agreed blegg and a 100-X% chance of being (in your eyes) a rube with properties that I consider definitive of a blegg. From your perspective, I may be concealing some relevant information, but I’m doing so openly and allowing you to draw correct probabilistic inferences.
  That’s not the same as “I reserve the right to lie p% of the time about whether something belongs to a category”; it’s the same as “I will consistently ‘lie’ about which of these categories some things belong to, because those things have properties that are not part of the usual definitions of the categories but which I consider important, namely <x> or <y>, and I will consistently say that things with <x> belong in category A and things with <y> belong in category B”. Which would be a weird way to put it, because I’m not actually lying if the meaning of my words is clear (albeit not informative in exactly the way you would prefer) and I am neither deceiving you nor intending to deceive you.
  - Zack_M_Davis 8 Jul 2025 4:24 UTC
    2 points
    0
    Parent
    Excellent question, thanks for commenting! (And for your patience.) The part of the original post that that Tweet is summarizing are the paragraphs after “Suppose I’m selling you some number of gold and silver bars [...]”. As you observe, whether it’s “lying” to use a category label depends on what the label is construed to “canonically” mean. The idea here is that, as far as signal processing goes, there’s an isomorphism between “lying p% of the time” with respect to the maximally-informative categories, and choosing different categories that conceal information. So if the new categories aren’t deceptive because the receiver knows about them, is lying therefore not deceptive if the speaker already has it “priced in” that the sender lies this-and-such proportion of the time? I discuss this problem further in “Maybe Lying Can’t Exist?!” and “Comment on ‘Deception is Cooperation’”.