Qualitative Strategies of Friendliness

Followup to: Magical Categories

What on Earth could someone possibly be thinking, when they propose creating a superintelligence whose behaviors are reinforced by human smiles? Tiny molecular photographs of human smiles—or if you rule that out, then faces ripped off and permanently wired into smiles—or if you rule that out, then brains stimulated into permanent maximum happiness, in whichever way results in the widest smiles...

Well, you never do know what other people are thinking, but in this case I’m willing to make a guess. It has to do with a field of cognitive psychology called Qualitative Reasoning.

Qualitative reasoning is what you use to decide that increasing the temperature of your burner increases the rate at which your water boils, which decreases the derivative of the amount of water present. One would also add the sign of d(water) - negative, meaning that the amount of water is decreasing—and perhaps the fact that there is only a bounded amount of water. Or we could say that turning up the burner increases the rate at which the water temperature increases, until the water temperature goes over a fixed threshold, at which point the water starts boiling, and hence decreasing in quantity… etc.

That’s qualitative reasoning, a small subfield of cognitive science and Artificial Intelligence—reasoning that doesn’t describe or predict exact quantities, but rather the signs of quantities, their derivatives, the existence of thresholds.

As usual, human common sense means we can see things by qualitative reasoning that current programs can’t—but the more interesting realization is how vital human qualitative reasoning is to our vaunted human common sense. It’s one of the basic ways in which we comprehend the world.

Without timers you can’t figure out how long water takes to boil, your mind isn’t that precise. But you can figure out that you should turn the burner up, rather than down, and then watch to make sure the water doesn’t all boil away. Which is what you mainly need, in the real world. Or at least we humans seem to get by on qualitative reasoning; we may not realize what we’re missing...

So I suspect that what went through the one’s mind, proposing the AI whose behaviors would be reinforced by human smiles, was something like this:

The happier people are, the more they smile. Smiles reinforce the behavior of the AI, so it does more of whatever makes people happy. Being happy is good (that’s what the positive connection to “utility” is about). Therefore this is a good AI to construct, because more people will be happy, and that’s better. Switch the AI right on!

How many problems are there with this reasoning?

Let us count the ways...

In fact, if you’re interested in the field, you should probably try counting the ways yourself, before I continue. And score yourself on how deeply you stated a problem, not just the number of specific cases.


Problem 1: There are ways to cause smiles besides happiness. “What causes a smile?” “Happiness.” That’s the prototype event, the one that comes first to memory. But even in human affairs, you might be able to think of some cases where smiles result from a cause other than happiness.

Where a superintelligence is involved—even granting the hypothesis that it “wants smiles” or “executes behaviors reinforced by smiles” - then you’re suddenly much more likely to be dealing with causes of smiles that are outside the human norm. Back in hunter-gatherer society, the main cause of eating food was that you hunted it or gathered it. Then came agriculture and domesticated animals. Today, some hospital patients are sustained by IVs or tubes, and at least a few of the vitamins or minerals in the mix may be purely synthetic.

A creative mind, faced with a goal state, tends to invent new ways of achieving it—new causes of the goal’s achievement. It invents techniques that are faster or more reliable or less resource-intensive or with bigger wins. Consider how creative human beings are about obtaining money, and how many more ways there are to obtain money today than a few thousand years ago when money was first invented.

One of the ways of viewing our amazing human ability of “general intelligence” (or “significantly more generally applicable than chimpanzee intelligence”) is that it operates across domains and can find new domains to exploit. You can see this in terms of learning new and unsuspected facts about the universe, and in terms of searching paths through time that wend through these new facts. A superintelligence would be more effective on both counts—but even on a human scale, this is why merely human progress, thinking with 200Hz neurons over a few hundred years, tends to change the way we do things and not just do the same things more effectively. As a result, a “weapon” today is not like a weapon of yestercentury, “long-distance communication today” is not a letter carried by horses and ships.

So when the AI is young, it can only obtain smiles by making the people around it happy. When the AI grows up to superintelligence, it makes its own nanotechnology and then starts manufacturing the most cost-effective kind of object that it has deemed to be a smile.

In general, a lot of naive-FAI plans I see proposed, have the property that, if actually implemented, the strategy might appear to work while the AI was dumber-than-human, but would fail when the AI was smarter than human. The fully general reason for this is that while the AI is dumber-than-human, it may not yet be powerful enough to create the exceptional conditions that will break the neat little flowchart that would work if every link operated according to the 21st-century First-World modal event.

This is why, when you encounter the AGI wannabe who hasn’t planned out a whole technical approach to FAI, and confront them with the problem for the first time, and they say, “Oh, we’ll test it to make sure that doesn’t happen, and if any problem like that turns up we’ll correct it, now let me get back to the part of the problem that really interests me,” know then that this one has not yet leveled up high enough to have interesting opinions. It is a general point about failures in bad FAI strategies, that quite a few of them don’t show up while the AI is in the infrahuman regime, and only show up once the strategy has gotten into the transhuman regime where it is too late to do anything about it.

Indeed, according to Bill Hibbard’s actual proposal, where the AI is reinforced by seeing smiles, the FAI strategy would be expected to short out—from our perspective, from the AI’s perspective it’s being brilliantly creative and thinking outside the box for massive utility wins—to short out on the AI taking control of its own sensory instrumentation and feeding itself lots of smile-pictures. For it to keep doing this, and do it as much as possible, it must of course acquire as many resources as possible.

So! Let us repair our design as follows, then:

Now the AI is not being rewarded by any particular sensory input - on which the FAI strategy would presumably short out—but is, rather, trying to maximize an external and environmental quantity, the amount of happiness out there.

This already takes us into the realm of technical expertise - distinctions that can’t be understood in just English, like the difference between expected utility maximization (which can be over external environmental properties that are modeled but not directly sensed) and reinforcement learning (which is inherently tied directly to sensors). See e.g. Terminal Values and Instrumental Values.

So in this case, then, the sensors give the AI information that it uses to infer a model of the world; the possible consequences of various plans are modeled, and the amount of “happiness” in that model summed by a utility function; and whichever plan corresponds to the greatest expectation of “happiness”, that plan is output as actual actions.

Or in simpler language: The AI uses its sensors to find out what the world is like, and then it uses its actuators to make sure the world contains as much happiness as possible. Happiness is good, therefore it is good to turn on this AI.

What could possibly go wrong?

Problem 2: What exactly does the AI consider to be happiness?

Does the AI’s model of a tiny little Super Happy Agent (consisting mostly of a reward center that represents a large number) meet the definition of “happiness” that the AI’s utility function sums over, when it looks over the modeled consequences of its actions?

As discussed in Magical Categories, the super-exponential size of Concept-space and the “unnaturalness” of categories appearing in terminal values (their boundaries are not directly determined by naturally arising predictive problems) means that the boundary a human would draw around “happiness” is not trivial information to infuse into the AI.

I’m not going to reprise the full discussion in Magical Categories, but a sample set of things that the human labels “happy” or “not happy” is likely to miss out on key dimensions of possible variances, and never wend through labeling-influencing factors that would be important if they were invoked. Which is to say: Did you think of presenting the AI with the tiny Super Happy Agent, when you’ve never seen such a thing? Did you think of discussing chimpanzees, Down Syndrome children, and Terry Schiavo? How late would it have been, in humanity’s technological development, before any human being could have and would have thought of the possibilities you’re now generating? (Note opportunity for hindsight bias.)

Indeed, once you start talking about how we would label new borderline cases we’ve never seen, you’re well into the realm of extrapolating volitions—you might as well ask how we would label these cases, if we knew everything the AI knew, and could consider larger sets of moral arguments, etc.

The standard dismissals here range from “Oh, of course I would think of X, therefore there’s no problem” for any particular X that you suggest to them, by way of illustrating a systemic problem that they can’t seem to grasp. Or “Well, I’ll look at the AI’s representation and see whether it defines ‘happiness’ the same way I do.” (As if you would notice if one of the 15 different considerations that affect what you would define as ‘truly happy’ were left out! And also as if you could determine, by eyeballing, whether an AGI’s internal representations would draw a border around as-yet-unimagined borderline instances, that you would find sensible.) Or the always popular, “But that’s stupid, therefore a superintelligence won’t make that mistake by doing something so pointless.

One of the reasons that qualitative planning works for humans as well as it does, is our ability to replan on the fly when an exceptional condition shows up. Can’t the superintelligence just obviously see that manufacturing lots of tiny Super Happy agents is stupid, which is to say ranked-low-in-our-preference-ordering? Not if its preference ordering isn’t like yours. (Followed by the appeals to universally compelling arguments demonstrating that making Super Happy agents is incorrect.)

But let’s suppose that we can magically convey to the AI exactly what a human would consider as “happiness”, by some unspecified and deep and technical art of Friendly AI. Then we have this shiny new diagram:

Of course this still doesn’t work—but first, I explain the diagram. The dotted line between Humans::”Happy” and happiness-in-the-world, marked “by definition”, means that the Happy box supposedly contains whatever is meant by the human concept of “happiness”, as modeled by the AI, which by a magical FAI trick has been bound exactly to the human concept of “happiness”. (If the happy box is neither what humans mean by happiness, nor what the AI means, then what’s inside the box? True happiness? What do you mean by that?)

One glosses over numerous issues here—just as the original author of the original Happy Smiling AI proposal did—such as whether we all mean the same thing by “happiness”. And whether we mean something consistent, that can be realized-in-the-world. In Humans::”Happy” there are neurons and their interconnections, the brain state containing the full and complete specification of the seed of what we mean by “happiness”—the implicit reactions that we would have, to various boundary cases and the like—but it would take some extrapolation of volition for the AI to decide how we would react to new boundary cases; it is not a trivial thing to draw a little dashed line between a human thought, and a concept boundary over the world of quarks and electrons, and say, “by definition”. It wouldn’t work on “omnipotence”, for example: can you make a rock that you can’t lift?

But let us assume all such issues away.

Problem 3: Is every act which increases the total amount of happiness in the universe, always the right thing to do?

If everyone in the universe just ends up with their brains hotwired to experience maximum happiness forever, or perhaps just replaced with orgasmium gloop, is that the greatest possible fulfillment of humanity’s potential? Is this what we wish to make of ourselves?

“Oh, that’s not real happiness,” you say. But be wary of the No True Scotsman fallacy - this is where you say, “No Scotsman would do such a thing”, and then, when the culprit turns out to be a Scotsman, you say, “No true Scotsman would do such a thing”. Would you have classified the happiness of cocaine as “happiness”, if someone had asked you in another context?

Admittedly, picking “happiness” as the optimization target of the AI makes it slightly more difficult to construct counterexamples: no matter what you pick, the one can say, “Oh, but if people saw that happen, they would be unhappy, so the AI won’t do it.” But this general response gives us the counterexample: what if the AI has to choose between a course of action that leads people to believe a pleasant fiction, or a course of action that leads to people knowing an unpleasant truth?

Suppose you believe that your daughter has gone on a one-way, near-lightspeed trip to the Hercules supercluster, meaning that you’re exceedingly unlikely to ever hear from her again. This is a little sad, but you’re proud of her—someone’s got to colonize the place, turn it into a human habitation, before the expansion of the universe separates it from us. It’s not as if she’s dead—now that would make you sad.

And now suppose that the colony ship strikes a landmine, or something, and is lost with all on board. Should the AI tell you this? If all the AI values is happiness, why would it? You’ll be sad then, and the AI doesn’t care about truth or lies, just happiness.

Is that “no true happiness”? But it was true happiness before, when the ship was still out there. Can the difference between an instance of the “happiness” concept, and a non-instance of the “happiness” concept, as applied to a single individual, depend on the state of a system light-years away? That would be rather an extreme case of “no true Scotsman”, if so—and by the time you factor in all the other behaviors you want out of this word “happiness”, including times when being sad is the right thing to do, and the fact that you can’t just rewrite brains to be happy, it’s pretty clear that “happiness” is just a convenient stand-in for “good”, and that everything which is not good is being rejected as an instance of “happy” and everything which is good is being accepted as an instance of “happy”, even if it means being sad. And at this point you just have the AI which does exactly what it should do—which has been hooked up directly to Utility—and that’s not a system to mention lightly; pretending that “happiness” is your stand-in for Utility doesn’t begin to address the issues.

So if we leave aside this dodge, and consider the sort of happiness that would go along with smiling humans—ordinary psychological happiness—then no, you wouldn’t want to switch on the superintelligence that always and only optimized for happiness. For this would be the dreaded Maximum Fun Device. The SI might lie to you, to keep you happy; even if it were a great lie, traded off against a small happiness, always and uncaringly the SI would choose the lie. The SI might rewire your brain, to ensure maximum happiness. The SI might kill off all the humans, and replace us with some different form of sentient life that had no philosophical objections to being always happy all the time in a little jar. For the qualitative diagram contains no mention of death as a bad thing, only happiness as a good, and the dead are not unhappy. (Note again how all these failures would tend to manifest, not during the AI’s early infrahuman stages, but after it was too late.)

The generalized form of the problem, is that being in the presence of a superintelligence that shares some but not all of your terminal values, is not necessarily a good thing.

You didn’t deliberately intend to completely change the 32-bit XOR checksum of your monitor’s pixel display, when you clicked through to this webpage. But you did. It wasn’t a property that it would have occurred to you to compute, because it wasn’t a property that it would occur to you to care about. Deep Blue, in the course of winning its game against Kasparov, didn’t care particularly about “the number of pieces on white squares minus the number of pieces on black squares”, which changed throughout the game—not because Deep Blue was trying to change it, but because Deep Blue was exerting its optimization power on the gameboard and changing the gameboard, and so was Kasparov, and neither of them particularly cared about that property I have just specified. An optimization process that cares only about happiness, that squeezes the future into regions ever-richer in happiness, may not hate the truth; but it won’t notice if it squeezes truth out of the world, either. There are many truths that make us sad - but the optimizer may not even care that much; it may just not notice, in passing, as it steers away from human knowledge.

On an ordinary human scale, and in particular, as a matter of qualitative reasoning, we usually assume that what we do has little in the way of side effects, unless otherwise specified. In part, this is because we will visualize things concretely, and on-the-fly spot the undesirable side effects—undesirable by any criterion that we care about, not just undesirable in the sense of departing from the original qualitative plan—and choose a different implementation instead. Or we can rely on our ability to react-on-the-fly. But as human technology grows more powerful, it tends to have more side effects, more knock-on effects and consequences, because it does bigger things whose effects we aren’t controlling all by hand. An infrahuman AI that can only exert a weak influence on the world, and that makes a few people happy, will seem to be working as its designer thought an AI should work; it is only when that AI is stronger that it can squeeze the future so powerfully as to potentially squeeze out anything not explicitly protected in its utility function.

Though I don’t intend to commit the logical fallacy of generalizing from fictional evidence, a nod here is due to Jack Williamson, author of With Folded Hands, whose AIs are “to serve and protect, and guard men from harm”, which leads to the whole human species being kept in playpens, and lobotomized if that tends to make them unhappy.

The original phrasing of this old short story—“guard men from harm”—actually suggests another way to illustrate the point: suppose the AI cared only about the happiness of human males? Now to be sure, many men are made happy by seeing the women around them happy, wives and daughters and sisters, and so at least some females of the human species might not end up completely forlorn—but somehow, this doesn’t seem to me like an optimal outcome.

Just like you wouldn’t want an AI to optimize for only some of the humans, you wouldn’t want an AI to optimize for only some of the values. And, as I keep emphasizing for exactly this reason, we’ve got a lot of values.

These then are three problems, with strategies of Friendliness built upon qualitative reasoning that seems to imply a positive link to utility:

The fragility of normal causal links when a superintelligence searches for more efficient paths through time;

The superexponential vastness of conceptspace, and the unnaturalness of the boundaries of our desires;

And all that would be lost, if success is less than complete, and a superintelligence squeezes the future without protecting everything of value in it.