Algorithmic Intent: A Hansonian Generalized Anti-Zombie Principle
“Why didn’t you tell him the truth? Were you afraid?”
“I’m not afraid. I chose not to tell him, because I anticipated negative consequences if I did so.”
“What do you think ‘fear’ is, exactly?”
The Generalized Anti-Zombie Principle calls for us to posit “consciousness” as casually upstream of reports of phenomenological experience (even if the causal link might be complicated and we might be wrong about the details of what consciousness is). If you’re already familiar with conscious humans, then maybe you can specifically engineer a non-conscious chatbot that imitates the surface behaviors of humans talking about their experiences, but you can’t have a zombie that just happens to talk about being conscious for no reason.
A similar philosophical methodology may help us understand other mental phenomena that we cannot perceive directly, but infer from behavior. The Hansonian Generalized Anti-Zombie Principle calls for us to posit “intent” as causally upstream of optimized behavior (even if the causal link might be complicated and we might be wrong about the details of what intent is). You can’t have a zombie that just happens to systematically select actions that result in outcomes that rank high with respect to a recognizable preference ordering for no reason.
It’s tempting to think that consciousness isn’t part of the physical universe. Seemingly, we can imagine a world physically identically to our own—the same atom-configurations evolving under the same laws of physics—but with no consciousness, a world inhabited by philosophical “zombies” who move and talk, but only as mere automatons, without the spark of mind within.
It can’t actually work that way. When we talk about consciousness, we do so with our merely physical lips or merely physical keyboards. The causal explanation for talk about consciousness has to either exist entirely within physics (in which case anything we say about consciousness is causally unrelated to consciousness, which is absurd), or there needs to be some place where the laws of physics are violated as the immaterial soul is observed to be “tugging” on the brain (which is in-principle experimentally detectable). Zombies can’t exist.
But if consciousness exists within physics, it should respect a certain “locality”: if the configuration-of-matter that is you, is conscious, then almost-identical configurations should also be conscious for almost the same reasons. An artificial neuron that implements the same input-output relationships as a biological one, would “play the same role” within the brain, which would continue to compute the same externally-observable behavior.
We don’t want to say that only externally-observable behavior matters and internal mechanisms don’t matter at all, because substantively different internal mechanisms could compute the same behavior. Prosaically, acting exists: even the best method actors aren’t really occupying the same mental state that the characters they portray would be in. In the limit, we could (pretend that we could) imagine an incomprehensibly vast Giant Lookup Table that has stored the outputs that a conscious mind would have produced in response to any input. Is such a Giant Lookup Table—an entirely static mapping of inputs to outputs—conscious? Really?
But this thought experiment requires us to posit the existence of a Giant Lookup Table that just happens to mimic the behavior of a conscious mind. Why would that happen? Why would that actually happen, in the real world? (Or the closest possible world large enough to contain the Giant Lookup Table.) “Just assume it happened by coincidence, for the sake of the thought experiment” is unsatisfying, because that kind of arbitrary miracle doesn’t help us understand what kind of cognitive work the ordinary simple concept of consciousness is doing for us. You can assume that a broken and scrambled egg will spontaneously reassemble itself for the sake of a thought experiment, but the interpretation of your thought-experimental results may seem tendentious given that we have Godlike confidence that you will never, ever see that happen in the real world.
The hard problem of consciousness is still confusing unto me—it seems impossible that any arrangement of mere matter could add up to the ineffable qualia of subjective experience. But the easier and yet clearly somehow related problem of how mere matter can do information-processing—can do things like construct “models” by using sensory data to correlate its internal state with the state of the world—seems understandable, and a lot of our ordinary use of the concept of consciousness necessarily deals with the “easy” problems, like how perception works or how to interpret people’s self-reports, even if we can’t see the identity between the hard problem and the sum of all the easy problems. Whatever the true referent of “consciousness” is—however confused our current concept of it may be—it’s going to be, among other things, the cause of our thinking that we have “consciousness.”
If I were to punch you in the face, I can anticipate the experience of you reacting somehow—perhaps by saying, “Ow, that really hurt! I’m perceiving an ontologically-basic quale of pain right now! I hereby commit to extract a costly revenge on you if you do that again, even at disproportionate cost to myself!” The fact that the human brain has the detailed functional structure to compute that kind of response, whereas rocks and trees don’t, is why we can be confident that rocks and trees don’t secretly have minds like ours.
We recognize consciousness by its effects because we can only recognize anything by its effects. For a much simpler example, consider the idea of sorting. Human alphabets aren’t just a set of symbols—we also have a concept of the alphabet coming in some canonical order. The order of the alphabet doesn’t play any role in the written language itself: you wouldn’t have trouble reading books from an alternate world where the order of the Roman alphabet ran KUWONSEZYFIJTABHQGPLCMVDXR, but all English words were the same—but you would have trouble finding the books on a shelf that wasn’t sorted in the order you’re used to. Sorting is useful because it lets us find things more easily: “The title I’m looking for starts with a P, but the book in front of me starts with a B; skip ahead” is faster than “look at every book until you find the one”.
In the days before computers, the work of sorting was always done by humans: if you want your physical bookshelf to be alphabetized, you probably don’t have a lot of other options than manually handling the books yourself (“This title starts with a Pl; I should put it … da da da here, after this title starting with Pe but before its neighbor starting with Po”). But the computational work of sorting is simple enough that we can program computers to do it and prove theorems about what is being accomplished, without getting confused about the sacred mystery of sorting-ness.
Very different systems can perform the work of sorting, but whether it’s a human tidying her bookshelf, or a punchcard-sorting machine, or a modern computer sorting in RAM, it’s useful to have a short word to describe processes that “take in” some list of elements, and “output” a list with the same elements ordered with respect to some criterion, for which we can know that the theorems we prove about sorting-in-general will apply to any system that implements sorting. (For example, sorting processes that can only compare two items to check which is “greater” (as opposed to being able to exploit more detailed prior information about the distribution of elements) can expect to have to perform comparisons, where is the length of the list.)
Someone who wasn’t familiar with computers might refuse to recognize sorting algorithms as real sorting, as opposed to mere “artificial sorting”. After all, a human sorting her bookshelf intends to put the books in order, whereas the computer is just an automaton following instructions, and doesn’t intend anything at all—a zombie sorter!
But this position is kind of silly, a gerrymandered concept definition. To be sure, it’s true that the internal workings of the human are very different from that of the computer. The human wasn’t special-purpose programmed to sort and is necessarily doing a lot more things. The whole modality of visual perception, whereby photons bouncing off a physical copy of Rationality: AI to Zombies and absorbed by the human’s retina are interpreted as evidence to construct a mental representation of the book in physical reality, whose “title” “begins” with an “R”, is much more complicated than just storing the bit-pattern 1010010 (the ASCII code for R) in RAM. Nor does the computer have the subjective experience of eagerly looking forward to how much easier it will be to find books after the bookshelf is sorted. The human also probably won’t perform the exact same sequence of comparisons as a computer program implementing quicksort—which also won’t perform the same sequence of comparisons as a different program implementing merge sort. But the comparisons—the act of taking two things and placing them somewhere that depends on which one is “greater”—need to happen in order to get the right answer.
The concept of “sorting into alphabetical order” may have been invented before our concept of “computers”, but the most natural concept of sorting includes computers performing quicksort, merge sort, &c.., despite the lack of intent. We might say that intent is epiphenominal with respect to sorting.
But even if we can understand sorting without understanding intent, intent isn’t epiphenominal to the universe. Intent is part of the fabric of stuff that makes stuff happen: there are sensory experiences that will cause you to usefully attribute intent to some physical systems and not others.
Specifically, whatever “intent” is—however confused our current concept of it may be—it’s going to be, among other things, the cause of optimized behavior. We can think of something as an optimization process if it’s easier to predict its effects on the world by attributing goals to it, rather than by simulating its detailed actions and internal state. “To figure out a strange plot, look at what happens, then ask who benefits.”
Alex Flint identifies robustness to perturbations as another feature of optimizing systems. If you scrambled the books on the shelf while the human was taking a bathroom break away from sorting, when she came back she would notice the rearranged books, and sort them again—that’s because she intends to achieve the outcome of the shelf being sorted. Sorting algorithms don’t, in general, have this property: if you shuffle a subarray in memory that the operation of the algorithm assumes has already been sorted, there’s nothing in the code to notice or care that the “intended” output was not achieved.
Note that this is a “behaviorist”, “third person” perspective: we’re not talking about some subjective feeling of intending something, just systems that systematically steer reality into otherwise-improbable states that rank high with respect to some preference ordering.
Robin Hanson often writes about hidden motives in everyday life, advancing the thesis that the criteria that control our decisions aren’t the same as the high-minded story we tell other people, and even the story we represent to ourselves. If you take a strictly first-person perspective on intent, the very idea of hidden motives seems absurd—a contradiction in terms. What would it even mean, to intend something without being aware of it? How would you identify an alleged hidden motive?
The answer is that positing hidden motives can simplify our predictions of behavior. It can be easier to “look backwards” from what goals the behavior achieves, and continues to achieve in the presence of novel obstacles, than to “look forwards” from a detailed model of the underlying psychological mechanisms (which are typically unknown).
Hanson and coauthor Kevin Simler discuss the example of nonhuman primates grooming each other—manually combing each other’s fur to remove dirt and parasites. One might assume that the function of grooming is just what it appears to be: hygiene. But that doesn’t explain why primates spend more time grooming than they need to, why they predominately groom others rather than themselves, and why the amount of time a species spends grooming is unrelated to the amount of hair it has to groom, but is related to the size of social groupings. These anomalies make more sense if we posit that grooming has been optimized for social-political functions, to provide a credible signal of trust. (The signal has to cost something—in this case, time—in order for it to not be profitable to fake.) The hygienic function of grooming isn’t unreal—parasites do in fact get removed—but the world looks more confusing if you assume the behavior is optimized solely for hygiene.
This kind of multiplicity of purposes is ubiquitous: thus, nobody does the thing they are supposedly doing: politics isn’t about policy, school is not about learning, medicine is not about health, &c.
There are functional reasons for some of the purposes of social behavior to be covert, to conceal or misrepresent information that it wouldn’t be profitable for others to know. (And covert motivations might be a more effective design from an evolutionary perspective than outright lying if it’s too expensive to maintain two mental representations: the real map for ourselves, and a fake map for our victims.) This is sometimes explained as, “We self-deceive in order to better deceive others,” but I fear that this formulation might suggest more “central planning” on the cognitive side of the evolutionary–cognitive boundary than is really necessary: “self-deception” can arise from different parts of the mind working at cross-purposes.
Ziz discusses the example of a father attempting to practice nonviolent communication with his unruly teenage son: the father wants to have an honest and peaceful discussion of feelings and needs, but is afraid he’ll lose control and become angry and threatening.
But angry threats aren’t just a random mistake, in the way it’s a random mistake if I forget to carry the one while adding 143 + 28. Random mistakes don’t serve a purpose and don’t resist correction: there’s no plausible reason for me to want the incorrect answer 143 + 28 = 161, and if you say, “Hey, you forgot to carry the one,” I’ll almost certainly just say “Oops” and get it right the second time. Even if I’m more likely to make arithmetic errors when I’m tired, the errors probably won’t correlate in a way that steers the future in a particular direction: you can’t use information about what I want to make better predictions about what specific errors I’ll make, nor use observations of specific errors to infer what I want.
In contrast, the father is likely to “lose control” and make angry threats precisely when peaceful behavior isn’t getting him what he wants. That’s what anger is designed to do: threaten to impose costs or withhold benefits to induce conspecifics to place more weight on the angry individual’s welfare.
Another example of hidden motives: Less Wrong commenter Caravelle tells a story about finding a loophole in an online game, and being outraged to later be accused of cheating by the game administrators—only in retrospect remembering that, on first discovering the loophole, they had specifically told their teammates not to tell the administrators. The earlier Caravelle-who-discovered-the-bug must have known that the admins wouldn’t allow it (or else why instruct teammates to keep quiet about it?), but the later Caravelle-who-exploited-the-bug was able to protest with perfect sincerity that they couldn’t have known.
Another example: someone asks me an innocuous-as-far-as-they-know question that I don’t feel like answering. Maybe we’re making a cake, and I feel self-conscious about my lack of baking experience. You ask, “Why did you just add an eighth-cup of vanilla?” I initially mishear you as having said, “Did you just add …” and reply, “Yes.” It’s only a moment later that I realize that that’s not what you asked: you said “Why did you …”, not “Did you …”. But I don’t correct myself, and you don’t press the point. I am not a cognitive scientist and I don’t know what was really going on in my brain when I misheard you: maybe my audio processing is just slow. But it seems awfully convenient for me that I momentarily misheard your question specifically when I didn’t want to answer it and thereby reveal that I don’t know what I’m doing—almost as if the elephant in my brain bet that it could get away with pretending to mishear you, and the bet paid off.
Our existing language may lack the vocabulary to adequately describe optimized behavior that comes from a mixture of overt and hidden motives. Does the father intend to make angry threats? Did the gamer intend to cheat? Was I only pretending to mishear your question, rather than actually mishearing it? We want to say No—not in the same sense that someone consciously intends to sort her bookshelf. And yet it seems useful to have short codewords to talk about the aspects of these behaviors that seem optimized. The Hansonian Generalized Anti-Zombie Principle says that when someone “loses control” and makes angry threats, it’s not because they’re a zombie that coincidentally happens to do so when being nice isn’t getting them what they want.
As Jessica Taylor explains, when our existing language lacks the vocabulary to accommodate our expanded ontology in the wake of a new discovery, one strategy for adapting our language is to define new senses of existing words that metaphorically extend the original meaning. The statement “Ice is a form of water” might be new information to a child or a primitive AI who has already seen (liquid) water, and already seen ice, but didn’t know that the former turns into the latter when sufficiently cold.
The word water in the sentence “Ice is a form of water” has a different extensional meaning than the word water in the sentence “Water is a liquid”, but both definitions can coexist as long as we’re careful to precisely disambiguate which sense of the word is meant in contexts where equivocation could be deceptive.
We might wish to apply a similar linguistic tactic in order to be able to concisely talk about cases where we think someone’s behavior is optimized to achieve goals, but the computation that determines the behavior isn’t necessarily overt or conscious.
Algorithmic seems like a promising candidate for a disambiguating adjective to make it clear that we’re talking about the optimization criteria implied by a system’s inputs and outputs, rather than what it subjectively feels like to be that system. We could then speak of an “algorithmic intent” that doesn’t necessarily imply “(conscious) intent”, similarly to how ice is a form of “water” despite not being “(liquid) water”. We might similarly want to speak of algorithmic “honesty” (referring to signals selected on the criterion of making receivers have more accurate beliefs), “deception” (referring to signals selected for producing less accurate beliefs), or even “fraud” (deception that moves resources to the agent sending the deceptive signal).
Some authors might admit the pragmatic usefulness of the metaphorical extension, but insist that the new usage be marked as “just a metaphor” with a prefix such as pseudo- or quasi-. But I claim that broad “algorithmic” senses of “mental” words like intent often are more relevant and useful for making sense of the world than the original, narrower definitions that were invented by humans in the context of dealing with other humans, because the universe in fact does not revolve around humans.
When a predatory Photuris firefly sends the mating signal of a different species of firefly in order to lure prey, I think it makes sense to straight-up call this deceptive (rather than merely pseudo- or quasi-deceptive), even though fireflies don’t have language with which to think the verbal thought, “And now I’m going to send another species’s mating signal in order to lure prey …”
When a generative adversarial network learns to produce images of realistic human faces or anime characters, it would in no way aid our understanding to insist that the system isn’t really “learning” just because it’s not a human learning the way a human would—any more than it would to insist that quicksort isn’t really sorting. “Using exposure to data as an input into gaining capabilities” is a perfectly adequate definition of learning in this context.
In a nearby possible future, when you sue a company for fraud because their advertising claimed that their product would disinfect wolf bites, but the product instead gave you cancer, we would hope that the court will not be persuaded if the company’s defense-lawyer AI says, “But that advertisement was composed by filtering GPT-5 output for the version that increased sales the most—at no point did any human form the conscious intent to deceive you!”
Another possible concern with this proposed language usage is that if it’s socially permissible to attribute unconscious motives to interlocutors, people will abuse this to selectively accuse their rivals of bad intent, leading to toxic social outcomes: there’s no way for negatively-valenced intent-language like “fraud” or “deception” to stably have denotative meanings independently of questions of who should be punished.
It seems plausible to me that this concern is correct: in a human community of any appreciable size, if you let people question the stories we tell about ourselves, you are going to get acrimonious and not-readily-falsifiable accusations of bad intent. (“Liar!” “Huh? You can argue that I’m wrong, but I actually believe what I’m saying!” “Oh, maybe consciously, but I was accusing you of being an algorithmic liar.”)
Unfortunately, as an aspiring epistemic rationalist, I’m not allowed to care whether some descriptions might be socially harmful for a human community to adopt; I’m only allowed to care about what descriptions shorten the length of the message needed to describe my observations.
Robin Hanson and Kevin Simler, The Elephant in the Brain: Hidden Motives in Everyday Life, Ch. 1, “Animal Behavior”