Unnatural Categories Are Optimized for Deception

Followup to: Where to Draw the Boundaries?

There is an important difference between having a utility function defined over a statistical model’s performance against specific real-world data (even if another mind with different values would be interested in different data), and having a utility function defined over features of the model itself.

Arbitrariness in the map doesn’t correspond to arbitrariness in the territory. Whatever criterion your brain is using to decide which word you want, is your non-arbitrary reason …

So the one comes back to you and says:

That seems wrong—why wouldn’t I care about the utility of having a particular model? I agree that categories derive much of their usefulness from “carving reality at the joints”—that’s one very important kind of consequence of choosing to draw category boundaries in a particular way. But other consequences might matter too, if we have some moral reason to value drawing our categories a particular way. I don’t see why I shouldn’t be willing to trade off one unit of categorizational nonawkwardness for units of morality, even if trading off a million units of categorizational nonawkwardness for the same units of morality would be bad.

I once read about an analogy between category boundaries and national borders. Imagine a diplomat trying to come up with a proposal for a two-state solution to the Israeli–Palestinian conflict. There’s no such thing as the “correct” border between Israel and Palestine, but there are consequences of choosing one border or another. For example, awarding territory to one side risks angering the other. For another, if the West Bank and Gaza Strip are to be part of Palestine, but Tel-Aviv and the southern city of Eilat are to be part of Israel, then topology forces you to decide which of Israel and Palestine gets to be continuous, and which will be split into two parts, because a “land bridge” between Gaza and the West Bank would separate Tel Aviv and Eilat, and vice versa. Since borders can’t be “true” or “false”, the diplomat’s task is and can only be to weigh these kinds of trade-offs.

Analogously, I think of language, following Eliezer Yudkowsky’s “A Human’s Guide to Words”, as being a human-made project intended to help people understand each other. It draws on the structure of reality, but has many free variables, so that the structure of reality doesn’t constrain it completely. This forces us to make decisions, and since these are not about factual states of the world—what the definition of a word really is, in God’s dictionary—we have nothing to make those decisions on except consequences.

… okay, I think I see the problem. I see how one might have gotten that out of “A Human’s Guide to Words”—if you skipped all the parts with math. I am now prepared to explain exactly what’s wrong here in more detail than my previous attempt: not just that this position is not in harmony with the hidden Bayesian structure of language and cognition, but how the hidden Bayesian structure of language and cognition explains why an intelligent system might find this particular mistake tempting in the first place, and what breaks as a result.

Category “boundaries” are a useful visual metaphor for helping explain the cognitive function of categorization. If you have the visualization but you don’t have the math, you might think you have the freedom to “redraw” the category “boundaries”. Simple, compact boundaries might tend to be more useful, but more complicated boundaries aren’t false and therefore aren’t forbidden if you have some non-epistemic reason to prefer them … right?

Only in the sense that no hypothesis is “false”! Categories, words, correspond to hypotheses—probabilistic models that make predictions. If I see a dolphin in the water, and I say, “Hey, there’s a dolphin!”, and you understand me, that enables you to predict quite a lot about there being this-and-such kind of aquatic mammal with fins, a tail, &c. in the water.

This AI capability of “speech” is not only very powerful; it’s also easy to understand the cause-and-effect evidential entanglement which explains how it works—at least at a very high level.

Photons bounce off the dolphin and hit my eyes. I recognize the photons as forming an image that matches a concept that I associate with the word/​symbol “dolphin” (implementation details omitted). I emit a “dolphin” signal composed of sound waves which hit your eardrum. By a convention that culturally evolved due to our predecessors having a shared interest in communicating with each other, you map the “dolphin” signal to an internal concept that closely resembles the one I associate with that same signal. This works because we happen to live in a world where the distribution of creatures has cluster-structure whereby dolphins have lots of things in common with each other, such that it’s possible to use observations about an entity to infer that it “is a dolphin”, and then use the dolphin concept to make good predictions about aspects of the entity that have not yet been observed; we owe our confidence that we’ve learned “the same” dolphin model to the fact that dolphins actually exist.

But the dolphin concept/​model/​hypothesis is subject to the universal mathematical laws of reasoning under uncertainty. In particular, probability-mass flows between hypotheses: as long as you never assign a probability of zero (which is a log-odds of negative infinity), nothing you believe can ever be definitively (infinitely) “falsified”—it “just” makes quantitatively worse predictions as compared to other hypotheses.

Because category “boundaries” are merely a visualization for a probabilistic model that makes predictions about the real world, you can’t “redraw the boundaries” associated with a communication signal without messing with the model that generates them, which means messing with your predictions about the real world.

Might there be some non-epistemic reason for an agent to prefer a model that makes worse predictions? Sure! Correct maps are useful for steering reality into configurations ranked higher in your preference ordering—but causing a different agent to have incorrect maps might make them mis-navigate reality in a way that benefits you! We call this deception.

In a related phenomenon, a poorly-designed agent might get confused and end up manipulating its own beliefs: optimizing its map to inaccurately portray a high-value territory (rather than optimizing the territory to be high-value by using a map that reflects the territory), a kind of self-deception. We call this wireheading.

The laws of probability and information theory allow us to calculate how information can be efficiently encoded and transmitted from one place to another. Given some distribution of random variables, and some specification of what information about those variables you want to transmit, some encodings—some ways of “drawing” category “boundaries”—quantitatively perform better than others. Agents that want to communicate with each other will tend to invent or discover conventions that efficiently encode the information they’re trying to communicate. Agents that communicate in ways that systematically depart from efficient encodings are better modeled as trying to deceive each other or wirehead themselves.


Let’s walk through a simple example. Imagine that you have a peculiar job in a peculiar factory: specifically, you’re a machine-learning engineer tasked with automating away the jobs of humans who sort objects from a mysterious conveyor belt.

Another engineer has already written a system that processes camera and sensor data about the objects into more convenient “features”: color (measured on an eight-point blueness scale), shape (measured on an eight-point “eggness” scale), and vanadium content (a boolean Yes or No). Your task is to further process this information into a format suitable for giving commands to other systems—for example, the robot arm that will physically move the objects into appropriate bins.

The feature data consists of the blueness–eggness–vanadium-content joint distribution given by this 128-entry table:

blueness–eggness–vanadium joint distribution

This seems like … not the most useful representation? The data is all there, so in principle, you could code whatever you needed to do based off the full table, but it seems like it would be an unmaintainable mess: you’d sooner resign than write a 128-case switch statement. Furthermore, when the system is deployed, you hope to typically be able to give the binning robot messages based on only the color and shape observations, because the Sorting Scanner that the vanadium readings come from is expensive to run. You could just do a Bayesian update on the entire joint distribution, of course, but it seems like it should be possible to be more efficient by exploiting regularities in the data, not entirely unlike how your colleague’s system has already made your job much simpler by giving you blueness and eggness feature scores rather than raw camera data. Eyeballing the table, you notice it seems to have a lot of redundancy: most of the probability-mass is concentrated in two regions where the blueness and eggness scores are either both high or both low—and vanadium is only found when both blueness and eggness are high.

O tragedy O the stars! If only there were some more convenient and flexible way to represent this knowledge—some kind of deep structural insight to rescue you from this cruel predicament!

… alright, dear reader—I shouldn’t patronize. You already know how this story ends. The distribution factorizes!

(The distribution in this made-up toy example factorizes exactly, but in a messy real-world application, you might have a spectrum of approximate models to choose from.)

We can simplify our representation of our observations by using a naïve Bayes model, a “star-shaped” Bayesian network where a central “category” node is posited to underlie all of our observations: we believe that each object either “is a blegg” (and therefore contains vanadium and has high blueness and eggness scores) with probability 0.48, “is a rube” (and therefore has no vanadium and low blueness and eggness scores) with probability 0.48, or belongs to a catch-all “other”/​error class with probability 0.04. (Maybe the camera is buggy sometimes, or maybe there are some other random objects mixed in with the rubes and bleggs?)

factorized object distribution

The full joint distribution had 127 degrees of freedom (a table of separate probabilities, constrained to add up to 1), whereas the naïve-Bayes representation only needs 57 parameters ( prior probabilities for the categories, plus , , and -entry conditional probability tables for each of the features). The advantage would be much larger for more complicated problems: the joint distribution table grows exponentially with more features, quickly becoming infeasible to store and represent, let alone learn.

It must be stressed that our “categories” here are a specific mathematical model that makes specific (probabilistic) predictions. Suppose we see a black-and-white photo of an egg-shaped object: specifically, one with an eggness score of 7. Given that observation of , we can update our probabilities of category-membership.

We think the egg-shaped object is almost certainly a blegg (specifically, with probability 0.96), even if the black-and-white photo doesn’t directly tell us how blue it is, because

We can then use our updated beliefs about category membership (0.96 blegg/​0 rube/​0.04 unknown, as contrasted to the 0.48/​0.48/​0.04 prior) to get our updated posterior distribution on the 0–7 blueness score (0.005/​0.005/​0.005/​0.005/​0.005/​0.245/​0.485/​0.245—left as an exercise for the reader).


In addition to categories facilitating efficient probabilistic inference within the system that you’re currently programming, labels for categories turn out to be useful for communicating with other systems. The robot arm in the Sorting room puts bleggs in a blegg bin, which gets taken to a room elsewhere in the factory where there’s sophisticated vanadium-ore-processing machinery that has to handle both bleggs and gretrahedrons.

But suppose the binning arm doesn’t need to know about the blueness and eggness scores: it can close its claws around rubes and bleggs alike, and you only need to program it to pick up an object from a certain spot on the conveyor belt and place it into the correct bin. However, the vanadium-ore-processing machine does need to do further information processing before it can operate on an object—perhaps it needs to vary its drill speed in proportion to the density of a particular blegg’s flexible outer material (which it can estimate based on how brightly the blegg glows in the dark), but it uses a different drilling pattern for gretrahedrons.

If you need to send commands to both the binning arm and the ore-processing machine, it’s a more efficient communication protocol to just be able to send the 28-byte JSON payload {"object_category": "BLEGG"} and let the other machines do their work using their own models of bleggs, rather than having to send over the raw camera data plus the binary code of the Bayesian network and feature extractors that you initially used to identify bleggs. Intelligence is prediction is compression: our ability to find an encoding that compresses the length of the message needed to convey information about the objects is fundamental to our having learned something about the distribution of objects.

The {"object_category": "BLEGG"} message is a useful shorthand for “linking up” the models between different machines. Different machines might not use the same model: the classifier system uses blueness and eggness scores to identify bleggs, but the ore-processing machine, having been told that an object is a blegg, can take its approximate blueness and eggness for granted and only needs to reason about its luminescence and vanadium content.

But this trick of using a signal to correlate the models between different machines only works because and insofar as both models are pointing to the same cluster-structure in reality. If the model in the classifier system doesn’t meaningfully match the model in the ore-processing system—if the classifier code sends the {"object_category": "BLEGG"} message given a object with blueness score between 5 and 7, but the ore-processor, upon receiving the {"object_category": "BLEGG"} message, positions its drills in the expectation of processing an object with an eggness score between 0 and 2—then the factory doesn’t work.


As a human learning math, it’s helpful to examine multiple representations of the same mathematical object. We’ve already seen our blueness–eggness–vanadium model represented as a table, and factorized into a graphical model. We’ve done also some algebraic calculations with it. But we can also visualize it: the set of camera observations that the model classifies as a blegg with probability can be thought of a area with a boundary in two-dimensional blueness–eggness space:

(“With probability ” because our catch-all “other”/​error category can also generate examples with high blueness and eggness scores; we can’t say things like “Everything inside the boundary in the diagram is a blegg” when we’re talking about a formal model where some of the categories generate overlapping observations in whatever subspace the diagram is depicting.)

If you were trying to teach someone about the hidden Bayesian structure of language and cognition, but thought your audience was too stupid or lazy to understand the actual math, you might be tempted to skip the part about factorizing a joint distribution into a star-shaped Bayesian network and just talk about “drawing” “boundaries” in configuration space for human convenience, perhaps with a hokey metaphor about national borders. Then the audience might walk away with the idea that there’s no reason not to replace the old blegg concept and its boring compact boundary, with a new blegg* concept that has an exciting squiggly border.

Alaska isn’t even contiguous with the rest of the United States. If that’s okay, why can’t the borders of bleggness be a little squiggly?

Because the “national borders” metaphor is just a metaphor. It immediately breaks down as soon as you try to do any calculations.

When we say that the United States purchased Alaska from the Russian Empire, that means that this-and-such physical area on the Earth’s surface went from being the territory of the Russian government, to being territory of the United States government, where land being the “territory of” a “government” is a complicated idea that has something to do Schelling points over who gives orders to policemen and soldiers in that area.

When you reprogram your machine-learning system to send an {"object_category": "BLEGG"} message when it sees an object with an eggness score of 2 and a blueness score of 1, then your vanadium-ore-processing machine wears down its drill bits trying to process a rube.

Other than the fact that some aspects of both of these situations can be usefully visualized as changes to a two-dimensional diagram depicting an area with a boundary, what do these situations have to do with each other? They don’t. Countries aren’t Bayesian networks. They just aren’t. When we depict a country on a map, we’re not talking about a cognitive system that can use observations of latitude to estimate probabilities of country-membership and then use that distribution on country-membership to get an updated probability distribution on longitude. (I mean, given a world map, you could program such a thing, but it seems kind of useless—it’s not clear why anyone would want that particular program.) Why would you expect to understand an AI-theory concept by telling a story about national borders?


So, that’s what’s wrong with the national-borders metaphor. But we haven’t yet really explained the problem with “unnatural” categories—those that you would visualize as a squiggly, “gerrymandered” boundary. The squiggly blegg* boundary doesn’t have the nice property of corresponding to the category labels in our nice factorized naïve Bayes model, but it still contains information. You can still do a Bayesian update on being told that an object lies within a squiggly boundary in configuration space. If that update eliminates half of your probability-mass, that’s one information-theoretic bit, no matter how the category is shaped in Thingspace.

If you only care about how much probability you assign to the exact answer, then a bit is a bit. But if an approximate answer is approximately as good—if your answerspace has a metric on it, so that “approximate” can mean something—then some bits can be more valuable than others.

Suppose some random variable is uniformly distributed on the set . You have the option of being told either whether an observation sampled from is even or odd, or whether is greater or less than 4.5. Either way, you eliminate half of your hypotheses: the entropy of your probability distribution goes from to . Either way, you’ve learned 1 bit.

Still, if you have to make a decision that depends on “how big” is, it seems like the “1–4 or 5–8″ category system is going to be more useful than the “even/​odd” category system, even though they both provide the same amount of information about the exact answer. If you learn that , then you know that is “small”, but if you learn that is odd, you haven’t learned much about how big it is: it could be 1, but it could just as well be 7.

To formalize this, let’s measure how “good” a category is using the expected squared error. “Error” is how much a prediction is wrong by: if you guessed was 2, but it was actually 5, your error would be , and your squared error would be the square of that, . The expected squared error of a probability distribution is, on average, the square of how much your guess about a sample from that distribution will be wrong. (The squared error has nicer mathematical properties than the absolute error.)

For our example of sampled from uniformly distributed on , your best-guess estimate of is going to be the expected value

And the initial expected squared error is

Suppose you then learn whether is even or odd.

With probability 0.5, you learn that is even. In that case, your new estimate taking that into account would be

and your new expected squared error (in the “even” possible world) would be

With probability 0.5, you learn that is odd. Similar calculations (left as an exercise) also give a new expected squared error of 5 in the “odd” possible world. Averaging over both cases (trivially, ), learning whether is even or odd only brought our expected squared error down from 5.25 to 5, barely changing at all.

In contrast, if you learn whether is 1–4 or 5–8, your expected squared error plummets to 1.25. (Exercise.) By being compact, the “1–4 or 5–8” category system is much more useful for getting close to the right answer than the “even/​odd” category system.

The same goes for natural categories versus squiggly category “boundaries” in configuration space; we just need to supply some metric to define what “close” means.

For our blueness–eggness–vanadium distribution, suppose we use the Euclidean distance on blueness-score ✕ eggness-score ✕ 1-if-vanadium-present-else-0. (So, for example, the “distance” between the typical blegg and the typical rube is under this metric.)

Then our expected squared error before being told anything about an object is about 13.63. On being told whether an object is a blegg, rube, or other (according to the categories in our nice factorized naïve Bayes model), our expected squared error plummets to 1.38.

But suppose that, instead of our nice factorized naïve Bayes model, we use a category system based on drawing squiggly “boundaries” in configuration space: everything inside the blegg* boundary in the diagram is a blegg*, everything within the rube* boundary in a rube*, and anything outside belongs to a catch-all “other*” category.

On learning whether an object is a blegg*, rube*, or other*, our expected squared error only goes down to about 4.12.[1]

In this sense, the gerrymandered blegg* concept is quantitatively less informative than the original, compact blegg concept. The metric we assigned to blueness–eggness–vanadium space was our choice, and could depend on our values: for example, if we simply don’t care about predicting how blue an object is, we could disregard the blueness score and only define a concept on the eggness–vanadium subspace (in which case our initial expected squared error is about 6.94, plummets to 0.69 given knowledge of blegg/​rube/​other category-membership, but only goes down to about 1.81 given knowledge of the gerrymandered blegg*/​rube*/​other* category). Or if we don’t care about predicting blueness very much, we could calculate our error score with respect to a metric that gave blueness very little weight. (Exercise.)

But given a metric on the variables that you care about predicting and using to inform predictions, which categories are cognitively useful depends on the the distribution of data in the world. You can’t define a word any way you want.


The dependence on a choice of metric on configuration space—and really, a choice of the space—gives a sense in which optimal categories are value-laden, but it’s a specific kind of lawful dependence between your values and the distribution of data in the world, not an atomic preference for using a particular encoding for its own sake.

The cognitive function of categorization is to group similar things together so that we can make similar decisions about them. A function measuring the extent to which things are “similar” has to take the things as input, but the extent to which things are decision-relevantly similar also depends on what you’re trying to accomplish with your decisions, and that can be algorithmically complex. It might not be just a matter of only looking at some decision-relevant subspace of a natural, “obvious” configuration space that’s available to all possible minds (like not caring what color your toothbrush handle is—um, if we pretend that all possible minds had human-like color vision); the dimensions of the space you do your similarity-clustering in might themselves be complicated features (in the sense of machine learning) of which agents with different values would have no reason to logically pinpoint that particular criterion by which things may be judged. How you should define words depends on what you want, but that’s not the same as defining words any way you want.

For example, poison isn’t a natural category to a generic mind studying chemistry: we group cyanide and hemlock together as poison because we value human health, and so we want to have a category for scary chemicals that disrupt human metabolism, causing death or serious illness. But this determination depends on the intricate details of human biochemistry. (The theobromine in chocolate is okay for humans at typical doses, but potentially fatal to dogs, which are actually pretty close to us in animalspace.) The compact category “boundary” that minimizes predictive error on human-healthspace, corresponds to a squiggly “boundary” in the chemicalspace you would be looking at if you’ve never seen a human and just want to make predictions about the chemicals themselves.

Or tiny molecular smileyfaces and real human smiles might be grouped together as similar as far as an image-classifier’s curve detector is concerned, even if they’re not similar as far as the abstracted idealized dynamic of human morality is concerned.

The technical sense in which optimal categories can be value-laden doesn’t alter the basic morals of our basic Bayesian philosophy of language. Your values can give you a particular configuration space and a metric on the space, but given that, sane agents want to “carve it at the joints” in order to get a communication system that minimizes predictive error. If you’re trying to find an efficient encoding of your observations, there’s no reason to want squiggly, gerrymandered categories in the decision-relevant space.


The one replies:

You’re still not addressing my crux! I don’t doubt what you say about minimizing prediction error with respect to some squared metric thingy. But what if that’s not what I care about? My utility function assigns high value to using the squiggly blegg* category boundary—such that the utility of using my preferred category outweighs the disutility of making less accurate predictions. You can define a word any way you want—if you’re willing to pay the costs.

So, what, you just intrinsically assign high utility to using the same communication signal to encode eggness-2/​blueness-1 observations as eggness-6/​blueness-6 observations, given the joint distribution specified in my story problem about sorting objects in a factory? Really?

″… yes!”

Okay, but where would that kind of exotic utility function come from? How would it arise naturally in an intelligent system?

There’s a trivial sense in which you can interpret any action taken by an agent as being taken because the agent values taking that action. This theory is compatible with all possible behaviors and therefore explains nothing.

The value of decision-theoretic utility functions isn’t that “Because utility!” serves as an all-purpose excuse for any possible behavior. It’s that simple coherence desiderata imply that an agent’s behavior should be describable as maximizing expected utility for some utility function—with corresponding constraints on the shape of that behavior.

Situations like the Allais paradox illustrate what these constraints look like. Consider an AI faced with playing the following game. There’s a switch that can be turned On or Off, that starts out on in the Off position. At midnight, a coin is flipped. If the coin comes up Tails, the game ends. If the coin comes up Heads, then at a quarter past midnight, if the switch is Off, then the AI gets paid $100, and if the switch is On, a six-sided die is rolled, and the AI gets paid $110 if the die doesn’t come up 6.

Suppose that, before midnight, the AI is willing to pay a dollar to flip the switch On (as if it thought that winning $110 with a probability of 512 is better than winning $100 with a probability of 12). Suppose the coin comes up Heads, and the AI is then willing to pay another dollar to flip the switch Off again (as if it thought that $100 with certainty is better than $110 with probability 56). Then the AI is two dollars poorer in exchange for the switch being in the same position it started in.

These gambling preferences violate the independence axiom of the von Neumann–Morgenstern utility theorem. You can’t have a utility function for which

and

because the sides of the second inequality are just those of the first multiplied by two, and multiplying by two should preserve the direction of inequality.

Having shown this, can we say that an AI with such behavior is “irrational”? But what does that even mean? If, for some reason, you specifically programmed the AI to prefer options it considers “certain”, or to want switches to be “On” before midnight but “Off” after midnight, then it would be functioning as designed.

What we can say about such an AI, is that it doesn’t have a utility function in terms of money, and is therefore not coherently optimizing for acquiring money. Recall that we say that a system is an optimizer if it systematically steers the future into configurations that rank higher with respect to some preference ordering. This helps us make predictions about what effects the system has, without having to model the details of how it brings those effects about. A well-designed agent that was optimizing for acquiring money would be expected to obey the independence axiom.

If the AI playing this game isn’t coherently optimizing for acquiring money, what is it optimizing for? To tell, we’d need to observe its behavior in different environments and see how it responds to perturbations. If it is trying to acquire money but is just biased to prefer certainty (in violation of the von Neumann–Morgenstern axioms), then we’d expect it to make choices that result in money but continue to exhibit Allais-like glitches around gambles involving probabilities close to 1. If it just likes switches to be off after midnight, then we’d expect it to turn switches off at that time even if there’s no gambling game going on.

This methodology for attributing goals to an agent—consider it to be “optimizing for” outcomes that it systematically achieves across a variety of environments—applies to the behavior of sending communication signals, just as it does to the behavior of flipping switches.

Back to the factory. Our classifier system sends a {"object_category": "BLEGG"} message when it gets feature data corresponding to the compact blegg concept. This behavior is optimized for sending messages that allow other systems to minimize the expected squared error of their predictions of objects with respect to our standard metric on blueness–eggness–vanadium space. We don’t intrinsically “assign utility” to using that particular category system; the category is the solution to an optimization problem about how to efficiently get blueness–eggness–vanadium information from one place to another.

A system that sends a {"object_category": "BLEGG"} message when it gets camera data corresponding to the gerrymandered blegg* concept would be optimized for … what? If you don’t intrinsically assign utility to using that particular category system, then why would you program the system that way? What could possibly be the problem for which the gerrymandered category is an optimized solution?

Well. Suppose that, besides your dayjob as a machine-learning engineer, you also happen to own a side interest in the firm that supplies bleggs and rubes to this very factory. And suppose that vanadium fetches higher market prices than palladium, such that the factory is to pay the supplier $2 per blegg but only $1 per rube—and that the accounts-payable records are to be compiled based on how much the classifier you’re currently programming sends {"object_category": "BLEGG"} and {"object_category": "RUBE"} messages, not how much metal actually gets harvested.

You can’t help but notice that you stand to make more money if the system you’re programming sends BLEGG messages more often. You can’t just make it send BLEGG messages all the time—someone would notice and you’d get fired. But the ore-processing room can cope with a few suboptimally-sorted objects. Surely it’s no big deal if you just … adjusted the category boundary of BLEGG-ness a bit?

We saw earlier that the blegg concept does better than the blegg* concept with respect to mean squared error (given a metric on the feature space).

That’s not the only possible scoring function with which one could formalize how “good” a category system is. Suppose that instead we score our category system by which one best minimizes the expected squared error minus supplier revenue in cents. With respect to this criterion, accurate predictions are still good, but supplier revenue is also good.

Learning whether an object is a blegg, rube, or other (according to the “natural” categories in our naïve Bayes model) yields a squared-error-minus-revenue score of about −142.62. (Don’t ask me what the units are on this.) But learning whether an object is a blegg*, rube*, or other* yields a squared-error-minus-revenue of −151.57, which is lower (which is better, because we formulated this as a minimization problem). So with respect to that scoring function, the blegg* category “boundary” is preferable.


The one says:

But now it sounds like you’re agreeing with me! The compact blegg category serves the factory owner’s goals better, which you formalized in terms of minimizing average squared error. The squiggly blegg* boundary makes the factory perform less well, but it serves the moonlighting engineer’s goals better, which you formalized in terms of minimizing squared error minus supplier revenue. There’s no rule of rationality against the engineer programming the system using the blegg* category boundary if it suits their goals better.

Only in the sense that there’s no rule of rationality against lying! Suppose I’m selling you some number of gold and silver bars, but you can’t examine the metal yourself until later; you can only hope that the receipt I give you is accurate. Consider the following two scenarios.

In the first scenario, I lie: the receipt says I delivered 60 gold bars and 20 silver bars, but I actually delivered 40 gold bars and 40 silver bars. You live in a low-trust world where lying is very common and contract enforcement isn’t really a thing: a third of the time an object is claimed to be gold, it turns out to be silver. So when you discover the fraud, you feel disappointed but not surprised: you would have preferred to get what you paid for, but you can’t say you anticipated it.

In the second scenario, I tell the truth—with respect to a category system that suits my goals. The receipt says I delivered 60 gold bars and 20 silver bars—and I did. It’s just that what I prefer to call “gold bars”, you prefer to call “gold bars, or silver bars with odd serial numbers”, and what I call “silver bars”, you call “silver bars with even serial numbers”. You know this, so when you examine the actual contents of the delivery, you feel disappointed but not surprised: you would have preferred to transact under your definitions of ‘gold’ and ‘silver’, but you can’t say you anticipated it.

We might question whether these are two different scenarios, or two descriptions of the same scenario: the same physical receipt, the same physical metal, the same buyer anticipations about the metal conditional on observing the receipt. If we just pay attention to the evidential entanglements instead of being confused by words, then there’s no functional difference between saying “I reserve the right to lie p% of the time about whether something belongs to category C”, and adopting a new, less-accurate category system that misclassifies p% of instances with respect to the old system.

Minimizing the squared-error score is about map–territory correspondence: ways of communicating that help the factory machines make better predictions about the objects, get a higher score.

Minimizing the squared-error-minus-supplier-revenue score is a compromise between map–territory correspondence and saying whatever makes the supplier the most money.

The degree of compromise is quantitative: there’s a continuum of possible scoring functions between “minimize expected squared error, only” (for which the naïve-Bayes categorizer is a good solution), and “maximize supplier revenue, only” (for which “always say BLEGG” is the optimal solution). If always saying whatever profits you and not revealing any information about the territory is deception pure and simple, then the intermediate points on a continuum with that can be thought of as partially deceptive.

Depending on your goals, deception can be rational! If you don’t care about other agents having accurate models and just want to intervene on them to make them believe whatever makes them behave in a way that benefits you—or whatever makes them happy—then you can do that! There’s no God to stop you. But in order to help you decide whether deceiving people is the right thing to do, it helps to notice that what you’re doing is deceiving people.


It helps to notice what you’re doing—if you’re trying to be an agent that coherently steers the future in some direction. But who does that, really? Maybe you just want to feel good! And not even coherently steer the universe into configurations where you feel good, either!

Rational agents should want to have true beliefs: the map that reflects the territory, is the map that is useful for navigating the territory. But you don’t—can’t—have unmediated access to the world; you can only infer what the world is like from sensory data, and effectively live in your model of the world. Given the tricky indirection involved, it’s not surprising that poorly-designed agents like humans sometimes get confused and “wirehead” themselves: if you don’t notice the difference, it’s tempting to fabricate a fake map that falsely portrays the territory as being good, instead of making a map that reflects the territory (which you can use to figure out how to improve the territory).

Similarly, if you don’t notice the difference, it’s tempting to choose language that makes the world sound good, than to have your language accurately describe the world (which description you can use to figure out how to make the world better).

Suppose I want people to think I’m funny. Funny is a value-laden concept in the specific lawful sense described earlier: non-human agents would have no motive to evaluate the particular fixed computation of humor. It’s also a fuzzy concept: we don’t have a simple test to precisely measure in standard units exactly how funny a joke is, but there’s enough regularity in how people use the word “funny” for the word to be a useful communication signal. It’s also a two-place concept: people have different senses of humor, so that what I consider funny isn’t exactly the same as what you consider funny.

Given all these complications, one could imagine being tempted to think that humor is “subjective”, and that therefore I can define it any way I want, and that therefore, if I feel sad about not being “funny”, I can fix that by changing my definition of the word “funny” such that it includes my jokes. Because definitions can’t be “false”, right!? There’s no rule of rationality prohibiting this boundary-redrawing project—and since I want so desperately to be “funny”, there’s every rule of human decency in favor of it, right?!

So, this obviously doesn’t work. (Okay, it “works” if you deliberately choose to define the word “work” such that it works, but it doesn’t actually work.) Yes requires the possibility of no: redefining X to make “Is it X?” come out true no matter what, loses the purpose of asking the question in the first place. The proposal to redefine the word “funny” came with the purported justification that words don’t have intrinsic meanings, so it can’t be “wrong” to redefine it. But precisely because words don’t have intrinsic meanings, there’s no reason to want to redefine an existing word, except to piggyback off the meaning people are already using that signal for.

(Note that this, in itself, isn’t necessarily deceptive. Sometimes, coining new senses of a word that piggyback off an existing meaning can be a powerful tool for extending our vocabulary to cover new phenomena that we don’t already have words for—as long as we’re careful to specify which meaning is intended when it’s not clear from context.)

It’s not plausible to suppose that I want to be “funny” because I like five-letter words that start with the letter f; I want to be funny because of what that communication signal is already understood to refer to in common usage. The redefinition might (or might not) succeed at making me feel better about myself, but if it does, it only works by means of confusing me: using strategic equivocation to arbitrage the hedonic gap between my new definition, and the old definition (which I still mentally associate with the word).

If it does succeed at making me feel better about myself, is the redefinition “rational”? Happiness is good, right? Should not rationalists win?

I do not frame an answer: that would depend on how you draw the category boundaries of “rational”, which is not an interesting question. (As it is written of a virtue which is nameless: if you speak overmuch of the Way, you will not attain it.)

What I can say, however, is that redefining the concept of humor is not a procedure that uses a map that reflects the territory to systematically achieve goals across a wide range of environments. If there’s anything I can do to become funnier (like practicing telling jokes in a mirror, or studying great comedians to imitate their timing and delivery), I would seem less likely to notice and execute on such a plan after having sabotaged the concept I would need to notice the problem in the first place.


The map is not the territory … but for real agents embedded in the physical universe, the map is part of the territory. This presents some complications to applications of our anti-wireheading moral. We don’t want to wirehead ourselves by making the map look good at the expense of undermining our ability to navigate the territory—but there’s no bright-line distinction demarcating which configurations of atoms are “the map”. From the perspective of the eternal, it’s all just territory.

In the previous post, we considered the case of an assembly line (well, sorting line) worker in the blegg–rube factory being excited about an ostensible promotion to the position of Vice President of Sorting—only to be aggrieved on finding out that it’s a promotion literally in name only, with no changes in pay, authority, or work tasks.

If we interpret the title as part of “the map”, a communication signal with the function of encoding information about the person’s job, then we want to say that the new title is substantively misleading (even if it’s not technically a “lie”): when you hear that someone’s job is being a “Vice President”, you predict that their work involves managing people and making high-level executive decisions for the firm. Your probability that the “Vice President” has to spend all day moving objects from a conveyor belt into one of two bins based on the object’s color and shape (a task that should probably be automated), is lower than before you heard the person’s title: hearing the title made you update in the wrong direction.

But if we interpret the title as part of “the territory”, a feature of the job itself, rather than a communication signal about the job—then it’s not misleading and can’t be misleading. The job happens to be one that has the symbols “Vice President” printed on the accompanying business cards and employee roster, much like how bleggs are objects that happen to be blue. You can’t say the blue is “lying”; that doesn’t make any sense!

The function of words is to serve as signals for communication, so it seems safe to say that language should usually be construed as part of “the map”. Changing names and only names, without altering the things that the names refer to, as in the phony “Vice President” example, is probably deceptive. But for other features associated with a category, it may not always be obvious when we should construe them as “map” rather than “territory”: using a feature to infer category-membership is formally equivalent to regarding it as a signal sent by senders of that category. Is that man pretending to be a doctor, or does he just happen to be wearing a lab coat?

The concept we’re groping towards, and hoping to formulate an elegant reduction of, is that of mimicry. Suppose there is some existing category of entity, an original, typified by some cluster of traits. A mimic is an entity optimized to approximately match the distribution of the original in many, but not all traits, thereby being part of the same cluster as the original in some subspace of the space the original category is defined in, but not the space as a whole. For example, if the vector is the original, then an optimization process trying to construct a mimic of it in the subspace spanned by , , and might choose : if you only look at the first, fourth, and fifth coordinates, then and “look the same”—they are the same in that subspace, but not the same if you include the second and third coordinates.

We can find examples in nature. Suppose one type of butterfly has evolved to be toxic to a type of predator, and also has distinctive wing markings that function as an honest warning signal to that predator: this butterfly is not good to eat. This provides an “opportunity” (in evolutionary time) for a second species of butterfly to develop similar wing markings, so that predators will confuse it for the first type of butterfly, despite the second butterfly not paying the metabolic cost of producing toxins. This kind of situation is called Batesian mimicry.

Is Batesian mimicry deceptive? (In our usual functionalist sense, which is obviously not a claim about butterfly psychology.) Is the second butterfly’s very existence a kind of lie?

In some sense, yes! The mimic butterfly has been optimized by evolution to look like the first butterfly because of the fitness payoff of being categorized by the predator as the first, toxic, kind of butterfly. The “categorized by the predator as toxic” category is a natural, compact region in wing-marking-space, but “comes apart” into two clusters in the broader wing-markings–actual-toxicity space.

Furthermore, the evolutionary dynamics create an asymmetric relationship between the two categories, that isn’t captured by just the two trait-clusters themselves. The reason for the mimic butterfly to have those particular wing-markings is in order to manipulate the predator’s predictions of toxicity (which was learned from encounters with the original), so if the original’s wing-markings were to change as a result of some new selection pressure, the mimic would be subjected to selection pressure to “keep up” by changing its wing-markings accordingly.

That’s not true in the other direction: if the mimic’s markings were to change, the original wouldn’t “follow”: the original would instead benefit from the probabilistic strength of its warning signal not being parasitically diluted by the mimic anymore. Thus, the asymmetric terminology of “original” and “mimic” is appropriate: it’s not just that these two species happen to look like each other; one of them was there first, and the other looks like it.

Is mimicry always deceptive? Not necessarily—there might be some situations where the relevant set of variables are among those where the mimic matches the distribution of the original.

Suppose you and I are feeding some ducks in the park. I say, “I love feeding these ducks!”

You say, “Wrong! These aren’t all ducks. This park is where a local inventor tests out his Anatid-oid robots that are designed to look and act like ducks. Therefore, you can’t say, ‘I love feeding these ducks’; you need to say ‘I love feeding these ducks and Anatidoid robots’.”

“Wow, they’re so realistic!” I say. “I can’t even tell which ones are really robots! In fact,” I continue, “since I can’t tell, I’m inclined to just keep calling them all ducks; it would be pretty awkward to refer to each one as a duck-or-Anatidoid-robot.”

“But it is possible to tell,” you claim. “For example, if you get really close to one of the Anatidoid robots, and there’s not a lot of ambient noise, you can hear the gears inside, turning.”

“Okay,” I say, “but I can’t hear the gears from here. Since I have no way of telling the difference between ducks and Anatidoid robots without doing the more expensive evidence-gathering of cornering one in a quiet place, it makes sense for me to talk and think about the robots as being a kind of duck.”

“But that’s a lie! Ducks and Anatidoid robots may look and act similarly, but they’re actually very different! Ducks are made of flesh and blood inside and are fated to die, whereas Anatidoid robots have a plastic interior and are immortal. And the ducks digest and gain nutrients from the scraps of bread we’re feeding them, whereas the Anatidoid robots merely store the bread in an internal compartment that later gets dumped as they recharge wirelessly in the inventor’s lab.”

“Sure,” I agree. “And if I were interacting with these entities in a context where I wanted to minimize the expected squared error of my predictions about their internal makeup, energy sources, or ultimate fate, then I would want to make that distinction. But I just want to watch some cool ducks in the park, and in the context of that activity, I only need to minimize the expected squared error of my predictions about appearance and behavior.”

This is the origin of the famous duck test: if it looks like a duck, and quacks like a duck, and you can model it as a duck without making any grievous prediction errors, then it makes sense to consider it a member of the category duck in the range of circumstances where your model continues to perform well.

The features for which mimics fail to match the original need not be hidden (like gear sounds that you can’t hear in a noisy park) in order for mimics to not be deceptive; they only need to be irrelevant in the context the category is being used. Squirt guns aren’t guns—and are usually manufactured in unrealistic colors specifically to prevent being confused with real guns—but in the context of a water fight, the utterance “Don’t point that gun at me” (without the privative adjective squirt) is understood perfectly well.

Nondeceptive mimicry is fragile, however: it works in contexts where the all the relevant features are ones where the mimic matches the original. Mimics that don’t match the distribution of the original along relevant features are deceptive in the sense that agents that observe the mimic and assign it to the same mental category as the original on the basis of the matching features, will use that categorization to make predictions about unobserved but nonmatching features, and be wrong. And they’ll be wrong because the mimic is optimized to “look like” the original (to match on many observable features).


If different agents using a shared language disagree on what features are “relevant”, they may have an incentive to fight about how scarce and valuable short codewords should be defined in their common language, in order to exert control over what inferences and decisions agents using that language can easily make and coordinate on.

Let’s consider how this might apply to a real-world issue. From moral perspectives that place a lot of value on the welfare of nonhuman animals, factory farming is an ongoing moral catastrophe. Unfortunately (for the farmed animals), meat-eaters and the global agriculture industry they support aren’t going to change their ways because of anyone’s desperate cry at the horror of suffering or carefully-reasoned appeal to the global utilitarian calculus. Animal-rights advocates can sway behavior on the margin, but there’s just too much biological and cultural inertia favoring the consumption of animal products for it to be feasible to outlaw factory farming the way chattel slavery was outlawed. It’s not that humans hate farm animals; they’re just … made out of tissue that we can use for other things.

An alternative strategy for ending factory farming is to prioritize the development of artificial substitutes that mimic real meat, eggs, dairy, &c. along the consumption-relevant dimensions of taste, texture, nutrition, &c., but are produced in a lab or factory rather than from the tissues of sentient creatures. In the limit of arbitrarily capable physical manufacturing technology, carnivores and factory-farming opponents alike could both be satisfied: if two steaks are indistinguishable by any physical means whatsoever, then a meat-eater has no reason to care which one came from an actual cow’s flesh, and which one was molecularly assembled by nanobots. Perhaps a Society of hunter–gatherers that attached cultural significance and ritual to the labor of killing one’s own meal would have a reason to object, but modern folk for whom food comes from the supermarket have no basis within their experience to say that the nanoassembled steak isn’t “real”.

Unfortunately, we do not have arbitrarily capable physical manufacturing technology. Although progress continues, modern animal product substitutes are sufficiently unsuccessful mimics that they are usually not considered to belong to “the same” category as the original. Veggie burgers are not burgers in the sense that a customer who ordered “a burger” at a restaurant and was served a veggie burger would be likely to notice and complain—and in particular, would probably not be satisfied if the waiter were to reply, “Well, if you specifically wanted a burger made from cow flesh, you should have said that.”

As technology to make plausible mimics/​substitutes improves, however, different interest groups might face a temptation to fight over the meanings of words that was not present when the mimics weren’t plausible enough for a dispute to arise. If you have the power of setting the default extension of a word that people are already using to communicate with, you can exert some amount of control over the decisions people make while trying to think using that word. Should the meaning change, then a restaurant customer who wants to make sure they receive a burger under the old definition now has to use more words, while those who don’t have a strong preference or are too shy to complain will accept the restaurant’s interpretation of the order.

Thus, if a fight breaks out about the meaning of the word meat, animal rights activists have a moral incentive to draw the category “boundaries” to include even substitutes that are very bad (on the empirical merits of successfully mimicking the original), whereas existing agricultural interests have a financial incentive to draw the “boundaries” to exclude even substitutes that are very good. (This kind of dispute is not hypothetical, and isn’t necessarily limited to just words: in the late 19th century, dairy farmers pushed for laws that required margarine to be dyed pink to prevent consumers from confusing it for butter—the law effectively interpreting color as a communication signal, rather than a property of the good itself.)

If a fight breaks out about the meaning of the word meat, rationalists may not all take the same side, but we can at least strive for objectivity in describing the conflict—and in particular, to notice the difference between definitions motivated by describing reality, and definitions motivated by the positive or negative effects (such as profitably deceiving other agents) of choosing one description or another.

If some think that some meat substitute should be considered meat because the “taste” dimension is genuinely most relevant to the true meaning of meat, and some oddities in the texture don’t matter, but others think vice versa, the philosophy articulated on this post has nothing to say to either side: the math of minimizing expected squared error by putting labels on clusters doesn’t say which subspace to look for clusters in.

But if some think that some meat substitute should be considered meat because saving nonhuman animals from a life of torture is more important than conceptual parsimony … I can’t prove that that’s not the right the answer to the decision problem of what verbal behavior to perform. The stakes are genuinely high.

What I can say is that the hidden Bayesian structure of language and cognition makes no reference to the stakes, and departing from the structure extracts a price that isn’t up to us.

If, empirically, being generous about what counts as “meat” can prevent massive suffering (by altering the social defaults around consumption behavior), then maybe that’s the right thing to do.

Similarly, if telling the public that masks don’t work for preventing respiratory disease can preserve supplies for medical professionals who need them more, then maybe that’s the right thing to do.

And if you live in an absurd thought experiment where saying “2 + 2 = 5” could save 3↑↑↑3 lives, maybe saying “2 + 2 = 5″ is the right thing to do. But the empirical question of whether you happen to live in that particular thought experiment, doesn’t change the laws that govern what you have when you take ●●-many plus another ●●-many, no matter what symbols are used to communicate this fact, and no matter the consequences for communicating it.


For these reasons it is written of the third virtue of lightness: you cannot make a true map of the category by drawing lines upon paper according to impulse; you must observe the joint distribution and draw lines on paper that correspond to what you see. If, seeing the category unclearly, you think that you can shift a boundary just a little to the right, just a little to the left, according to your caprice, this is just the same mistake.

And as it is written of a virtue which is nameless: perhaps your conception of rationality is that it is rational to believe the words of the Great Teacher, who lives in an area where claiming that the sky is blue would be political suicide.

And the Great Teacher says, “Some people I usually respect for their willingness to publicly die on a hill of facts, now seem to be talking as if color references are necessarily a factual statement about frequencies of light. But using language in a way you dislike, is not lying. You’re not standing in defense of Truth if you insist on a word, brought explicitly into question, being used with some particular meaning.” And you look up at the sky and see blue.

If you think: “It may look like the sky is blue, such that I’d ordinarily think that someone who said ‘The sky is green’ was being deceptive, but surely the Great Teacher wouldn’t egregiously mislead people about the philosophy of language when being egregiously misleading happens to be politically convenient,” you lose a chance to discover your mistake.

How will you discover your mistake? Not by comparing your description to itself.

But by comparing it to that which you did not name.

(Thanks to Jessica Taylor, Abram Demski, and Tsvi Benson-Tilson for discussion and feedback.)


  1. ↩︎