Generated during MATS 2.1.

A distillation of my understanding of Eliezer-consequentialism.

Thanks to Jeremy Gillen, Ben Goodman, Paul Colognese, Daniel Kokotajlo, Scott Viteri, Peter Barnett, and Garrett Baker, for discussion and/or feedback; to Eliezer Yudkowsky for briefly chatting about relevant bits in planecrash; to Quintin Pope for causally significant conversation;^[1] and to many others that I’ve bounced my thoughts on this topic off of.

Introduction

What is Eliezer-consequentialism? In a nutshell, I think it’s the way that some physical structures monotonically accumulate patterns in the world. Some of these patterns afford influence over other patterns, and some physical structures monotonically accumulate patterns-that-matter in particular—resources. We call such a resource accumulator a consequentialist—or, equivalently, an “agent,” an “intelligence,” etc.

A consequentialist understood in this way is (1) a coherent profile of reflexes (a set of behavioral reflexes that together monotonically take in resources) plus (2) an inventory (some place where accumulated resources can be stored with better than background-chance reliability.)

Note that an Eliezer-consequentialist is not necessarily a consequentialist in the normative ethics sense of the term. By consequentialists we’ll just mean agents, including wholly amoral agents. I’ll freely use the terms ‘consequentialism’ and ‘consequentialist’ henceforth with this meaning, without fretting any more about this confusion.

Path to Impact

I noticed hanging around the MATS London office that even full-time alignment researchers disagree quite a bit about what consequentialism involves. I’m betting here that my Eliezer-model is good enough that I’ve understood his ideas on the topic better than many others have, and can concisely communicate this better understanding.

Since most of the possible positive impact of this effort lives in the fat tail of outcomes where it makes a lot of Eliezerisms click for a lot of alignment workers, I’ll make this an effortpost.

The Ideas to be Clarified

I’ve noticed that Eliezer seems to think the von Neumann-Morgenstern (VNM) theorem is obviously far reaching in a way that few others do.

Understand the concept of VNM rationality, which I recommend learning from the Wikipedia article… Von Neumann and Morgenstern showed that any agent obeying a few simple consistency axioms acts with preferences characterizable by a utility function.

--MIRI Research Guide (2015)

Can you explain a little more what you mean by “have different parts of your thoughts work well together”? Is this something like the capacity for metacognition; or the global workspace; or self-control; or...?
No, it’s like when you don’t, like, pay five apples for something on Monday, sell it for two oranges on Tuesday, and then trade an orange for an apple.
I have still not figured out the homework exercises to convey to somebody the Word of Power which is “coherence” by which they will be able to look at the water, and see “coherence” in places like a cat walking across the room without tripping over itself.
When you do lots of reasoning about arithmetic correctly, without making a misstep, that long chain of thoughts with many different pieces diverging and ultimately converging, ends up making some statement that is… still true and still about numbers! Wow! How do so many different thoughts add up to having this property? Wouldn’t they wander off and end up being about tribal politics instead, like on the Internet?
And one way you could look at this, is that even though all these thoughts are taking place in a bounded mind, they are shadows of a higher unbounded structure which is the model identified by the Peano axioms; all the things being said are true about the numbers. Even though somebody who was missing the point would at once object that the human contained no mechanism to evaluate each of their statements against all of the numbers, so obviously no human could ever contain a mechanism like that, so obviously you can’t explain their success by saying that each of their statements was true about the same topic of the numbers, because what could possibly implement that mechanism which (in the person’s narrow imagination) is The One Way to implement that structure, which humans don’t have?
But though mathematical reasoning can sometimes go astray, when it works at all, it works because, in fact, even bounded creatures can sometimes manage to obey local relations that in turn add up to a global coherence where all the pieces of reasoning point in the same direction, like photons in a laser lasing, even though there’s no internal mechanism that enforces the global coherence at every point.
To the extent that the outer optimizer trains you out of paying five apples on Monday for something that you trade for two oranges on Tuesday and then trading two oranges for four apples, the outer optimizer is training all the little pieces of yourself to be locally coherent in a way that can be seen as an imperfect bounded shadow of a higher unbounded structure, and then the system is powerful though imperfect because of how the power is present in the coherence and the overlap of the pieces, because of how the higher perfect structure is being imperfectly shadowed. In this case the higher structure I’m talking about is Utility, and doing homework with coherence theorems leads you to appreciate that we only know about one higher structure for this class of problems that has a dozen mathematical spotlights pointing at it saying “look here”, even though people have occasionally looked for alternatives.
And when I try to say this, people are like, “Well, I looked up a theorem, and it talked about being able to identify a unique utility function from an infinite number of choices, but if we don’t have an infinite number of choices, we can’t identify the utility function, so what relevance does this have” and this is a kind of mistake I don’t remember even coming close to making so I do not know how to make people stop doing that and maybe I can’t.
--(Richard and) Eliezer, Ngo and Yudkowsky on alignment difficulty (2021)

In what follows I’ll try to explicate that hinted-at far-reachingness of the math of the VNM theorem. Most of the bits in my model are sourced from the above excerpt, plus the relevant parts of Eliezer’s broader corpus.

The Main Reason People Have Not Seen VNM as Far-Reaching

The main reason people haven’t thought much of the VNM theorem is that the theorem is trivially satisfiable for any physical object. That is, the theorem apparently proves too much. If any system can be modeled as optimizing a utility function, then VNM cannot be the mathematical true name of “agency.”

In more detail, the trivial satisfiability issue arises because in VNM, there is no requirement that utility functions be simple. So, despite a system exhibiting intuitively non-agentic behavior, we can nevertheless construct a utility function for that system.

Think of a system as outputting one ‘behavior’ at every timestep. Then, looking back over that system’s historical behavior, consider the function sending every past behavior of the system at each timestep to $1$ (and every other possible behavior at each timestep to $0$ , say). Even a rock can thereby be thought of as optimizing for doing exactly what that rock historically did and nothing else.

Being Well Modeled as Optimizing a Utility Function

Take a cat. Suppose that the firings of each little patch of neurons in the cat’s brain are uncorrelated with the firings of its other neurons some distance away. The left top of the cat’s brain is not sensitive to the right bottom of the cat’s brain, and vice versa. Looking at firing patterns in one patch of its brain will not give you information about the unobserved firings in a separate patch of its brain.

This cat… is having a seizure (or something of the sort). Because the inputs into the cat’s front-right and back-left legs, say, aren’t correlated with one another, the cat isn’t walking step after step in a straight line. Indeed, if all its neural regions are just going off independently, without firing in concert, the cat cannot be doing anything interesting at all! If you observe some other cat out in the neighborhood stalking mice, chalking that hunting behavior up to entropy would be an exceptionally poor explanation of the hunting cat’s brain. Entropy does not catch mice—coordinated neural structure must be present, behind the scenes of interesting behavior.

Dan Dennett famously distinguishes between structures that we best model as inanimate objects and structures we best model as agents. In the former case, we use a suite of folk-physical heuristics to think about what possessions left sitting unobserved in our apartment are currently doing. In the latter case, we instead natively think in terms of means, motive, and opportunity—fundamentally agentic heuristics. In reality, our universe is just one great big computation, bound tightly together, that doesn’t use separate rules for computing its inanimate objects and its agents. But humans’ computational overheads are real features of the world too. For humans, epistemically modeling the great big world depends on two pretty good kinds of approximation to the truth.

You can always try to see a rock as an agent—no one will arrest you. But that lens doesn’t accurately predict much about what the inanimate object will do next. Rocks like to sit inert and fall down, when they can; but they don’t get mad, or have a conference to travel to later this month, or get excited to chase squirrels. Most of the cognitive machinery you have for predicting the scheming of agents lies entirely fallow when applied to rocks.

On the other hand, dogs are indeed entirely physical objects, and you can try to understand them as inanimate objects, the way you understand unattended glasses on a table. Does this frame predict the dog will bark at the mailman? It certainly predicts that a dog will have ragdoll physics, but it’s clueless about the dog needing to take a crap once a day. Our inanimate object heuristics aren’t empirically adequate here. A Laplacean demon would have no problem here using the inanimate objects frame… but for us, with our overhead constraints, this isn’t so.

The difference between the rock and the dog is only that one is well modeled by us as an agent and the other as an inanimate object. Both heuristic frames are leaky—the true ontology of our universe is mathematical physics, not intuitive physics—but the patterns those schemes successfully anticipate in everyday human life are actual successes in correctly expecting future observations. Bear this two-place function—from an object to be anticipated and a computational budget to a predictive score—in mind when thinking about what’s an agent and what isn’t.

How Worrisome is a “Model Aptness” Free Parameter in a Theory?

Say, then, that we weight utility functions by some notion of simplicity, and say that something is an agent insofar as its behavior is anticipated by a simplicity-weighted utility function. What does nature have to say about this? Is this a theoretically reasonable use of VNM… or a misguided epicycle atop a theory nature is trying to warn us off of?

One reason to think nature is okaying this move: other similar, scientifically useful formalisms have “reasonableness” or “aptness” free parameters. I think that nature is thereby telling us not to be too obstinate about this.

Take Solomonoff induction. Solomonoff inductors take as input a “reasonable” Universal Turing Machine, whatever “reasonable” means exactly:

ASHLEY: My next question is about the choice of Universal Turing Machine—the choice of compiler for our program codes. There’s an infinite number of possibilities there, and in principle, the right choice of compiler can make our probability for the next thing we’ll see be anything we like. At least I’d expect this to be the case, based on how the “problem of induction” usually goes. So with the right choice of Universal Turing Machine, our online crackpot can still make it be the case that Solomonoff induction predicts Canada invading the USA.
BLAINE: One way of looking at the problem of good epistemology, I’d say, is that the job of a good epistemology is not to make it impossible to err. You can still blow off your foot if you really insist on pointing the shotgun at your foot and pulling the trigger.
The job of good epistemology is to make it more obvious when you’re about to blow your own foot off with a shotgun. On this dimension, Solomonoff induction excels. If you claim that we ought to pick an enormously complicated compiler to encode our hypotheses, in order to make the ‘simplest hypothesis that fits the evidence’ be one that predicts Canada invading the USA, then it should be obvious to everyone except you that you are in the process of screwing up.
ASHLEY: Ah, but of course they’ll say that their code is just the simple and natural choice of Universal Turing Machine, because they’ll exhibit a meta-UTM which outputs that UTM given only a short code. And if you say the meta-UTM is complicated—
BLAINE: Flon’s Law says, “There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code.” You can’t make it impossible for people to screw up, but you can make it more obvious.

Closely relatedly, Bayesianism requires one to have a “reasonable prior” in order for accurate hypotheses to rise to the top. There exist suitably monstrous priors for which a Bayesian won’t converge to accurate hypotheses, despite being beat over the head with empirical evidence to the contrary. But we don’t worry too much about this—it’s easy for us, if we’re not too theoretically stubborn about it, to provide any number of suitable priors on which Bayesianism works wonderfully.

The situation with trivial utility functions for rocks is similar. The fact that a rock isn’t well predicted by any simple^[2] utility function is nature’s way of telling us that the rock isn’t an agent (or, is at most only negligibly an agent).

Presentation of Consequentialism via Analogy to the Thermodynamic Definition of Life

The thermodynamic definition of life is that living things are pools of negentropy that remain stable in the face of outside disorder. That is, a bird possesses a metabolism, and uses its metabolism to convert ordered structure in the outside world (e.g., seeds) into more bird and to shunt disorder out of itself (as body heat and waste).^[3]

This is a very embedded agency flavored theory of life. It gives us, taking as input only the great big computational brick that is our universe, a way to pinpoint the organisms embedded in that computation! Moreover, it suggests a principled notion of “inside of the organism” and “outside of it.” Where an organism’s membrane falls isn’t an arbitrary subjective boundary for an outside observer to draw. An organism’s membrane is that boundary with the order on the inside and the disorder robustly kept outside.

I want to begin with this account of life and generalize it a bit, to arbitrary (not necessarily biological) agents.

We Call The Patterns That Are The Commanding Heights, “Resources”

The trouble with a resource-based framing of consequentialism is that “one man’s trash is another man’s treasure.” Different agents terminally value different things, period, and so they also instrumentally value different things. This means that understanding agents as “resource traps” is unenlightening, if resources are just meant to be those things that agents find valuable.

“It’s no accident that all animals eat food.”^[4]

Crucially, though, the universe is quite opinionated about which patterns matter more and less! Influence over some patterns, like high-tech manufacturing, comes bundled for free with influence over many other patterns, such as adaptability inside harsh environments and/or your probes getting to survey Jupiter. Channeling idiosyncratic interpretative dance technique through your every motion doesn’t lend leverage over other parts of the world. If one kind of “pattern trap” accumulates the high-tech manufacturing plants and another pattern trap accumulates the idiosyncratic dance steps, it’s the former that shapes their future lightcone.

Pattern traps are vortices in our universe that tend to collect patterns of some description. Consequentialists are the pattern traps that accumulated the patterns-that-matter.

Principled Cartesian Boundaries Keep the Good Stuff Inside

4 billion years ago, ‘RNA world’ life first emerged on Earth. Those early replicators didn’t assemble themselves in the early seas by sheer chance, however. They budded inside bubbles in porous rock, where some ions could be locked inside and others reliably kept outside (or so one theory goes, anyways.) What was going on at the skin of that bubble, what outward-facing “ion turnstiles” were installed there across the membrane, had everything to do with where exactly the first viable prokaryote formed. Everything was about preferentially trapping the good stuff and not the bad stuff inside, as disorganized matter floated in from nearby outside.

If you have some discordant turnstile down towards the bottom of the bubble, that weak link will be enough to pump out lots of the good stuff the rest of your turnstiles are working so hard to keep inside! Evolution would have favored behavioral coherence among the ion pumps on a membrane.^[5]

Analogously, say you started a world off with a bunch of simple critters with randomly installed sets of reflexes. Each of these installed reflexes is a simple mechanism, sphexishly executing its subroutine when its sensory trigger appears nearby. If a critter sees a blue fruit, one reflex might grab it and eat it, or grab it and hide it, or run away from it, or jump on it...

Most sets of reflexes don’t work together to do anything interesting. An animal that’s compelled to both put things into its mouth and to spit out anything in its mouth won’t eat well. Only when all the reflexes in the animal are serendipitously running in global coherence does interesting aggregate behavior result. Some patterns in the outside world can be, for example, collected and hidden away somewhere safe, or eaten ASAP. If the universe imposes some caloric tax on living things, globally coherent reflexes will allow a critter to pay the tax. If the universe allows you to make trades, locally expending calories now in exchange for a big pile of calories you might hunt down, having a set of reflexes that all coherently work to value calories at the same rate will make for a successful hunter.

After evolution rolls on for a bit, the surviving critters all have boundaries, principled Cartesian skins keeping good stuff inside and hazards outside. It’s not just that you can choose to see the universe as partitioned between the organism and its outside world. It’s that there are all these globally coherent reflexes of the organism lying exactly at that particular possible Cartesian boundary. Agents are bubbles of reflexes, when those reflexes are globally coherent among themselves. And exactly what way those reflexes are globally coherent (there are many possibilities) fixes what the agent cares about terminally tending toward.

Average-Case Monotonicity in Resources is the Theoretical Centerpiece Here, Not Adversarial Inexploitability

The central benefit to being a consequentialist isn’t inexploitability-in-the-adversarial-case. Agents in our world didn’t evolve in an overwhelmingly worst-case adversarial environment. Rather, the central benefit is in becoming more efficient in the average case. You waste fewer calories when you’re less spazzy. You can then run for longer when dinner’s on the line. You’re strictly better at hunting than your less concerted counterparts. Selection smiles on you!

(Later on, after you’re already a pretty smart forethinking consequentialist, you can notice this fact explicitly and actively work on becoming ever more consequentialist.)

Like Taking Candy from a Baby

To return to rocks not having simple utility functions, only trivial ones, notice that putting a hundred-dollar bill on top of a rock does not get the rock closer to its goals. Hundred-dollar bills are generally useful resources to essentially any agent. But the rock fails to take advantage; it has no reflexes that are sensitive to hundred-dollar bills, or to any other nearby obviously helpful resources.

You can just take hundred-dollar bills that are sitting atop a rock. The rock isn’t fighting in the ordinary agentic way (to, say, become as much of a rock as possible), and the rock isn’t consistent in how it grabs and lets go of resources. If the rock has a consistent valuation over lotteries-over-objects, the objects it’s centrally valuing are the steps in the dance of being an ordinary rock. If we weight our hypotheses in favor of those that are more behaviorally coherent with regard to nearby resources, we see that it would be quite the odd agent that behaves just like an ordinary rock. On the contrary, the objects that a rock cares about are patterns that have to be individuated very strangely (e.g., “the steps in the dance of being an ordinary rock”), and this is nature’s tip off that the consequentialist frame isn’t very applicable here.

Significance: The Steelman of Bureaucracy, and its Inverse

Jaan Tallinn’s steelman of bureaucracy is an argument for bureaucratic norms, to promote coordination:

in my view deontology is more about coordination than optimisation: deontological agents are more trustworthy, as they’re much easier to reason about (in the same way how functional/declarative code is easier to reason about than imperative code). hence my steelman of bureaucracies (as well as social norms): humans just (correctly) prefer their fellow optimisers (including non-human optimisers) to be deontological for trust/coordination reasons, and are happy to pay the resulting competence tax.

Stereotypical bureaucracies (the canonical example being the DMV), the argument goes, are awful places to have to visit because their bureaucrats are blankfaced paper-pushers who always mindlessly implement the rulebook. Given an opportunity for a bureaucrat to slightly bend the rules to better fulfill the intentions behind the rulebook, stereotypical bureaucrats stubbornly refuse to do so. On the one hand, the DMV will not help you out unless you return with the blue official form—you filled out the pink version of the same form. On the other hand, that the DMV bureaucrats won’t take any guerilla actions (they’ll never decide to “move fast and break things”) also makes them more predictable to outsiders, and so easier to reliably plan around.

Suppose that you had a scary competent DMV instead. They do all the interfacing with the other government entities for you, quickly, asking for just the bare minimum of needed information in exchange. They don’t worry too much about dotting their i’s and crossing their i’s—they self-consciously move fast and break things. These guerilla bureaucrats all have consonant reflexes—no weird blind spots—where classic bureaucrats have deontological blocks in place. There are things the classic bureaucrats don’t do. The guerilla bureaucrats have reflexes that add up to keeping the goodness inside the org, with no reflexes breaking from that overall trend.

There’s something odd about going in expecting these guerilla bureaucrats to let their system fall apart because an incompetent higher-up was appointed. They’re no strangers to interfacing with and routing around difficult, large governmental bodies. They’re used to saving and spending political and social capital, and they know who’s competent, where. If they’re all experts in accumulating and spending all of these disparate resources, why would you a priori expect them to fail outright at similarly managing the incompetent-manager resource? More probably, you’d a priori guess that the incompetent manager will be quickly driven out or otherwise sidelined by effectively coordinating underlings.

A track record of juggling the important patterns in the world so as to get the job done is the hallmark of consequentialists. The stereotypical bureaucracy can be corrigible to an incompetent higher-up. They do what they’re ordered to, and don’t improvise their way out from under difficult internal political situations. The guerilla bureaucracy does improvise, and can be reasonably expected to apply those same skills to saving their predictably crashing organization. If you’re the incompetent higher-up, though, this looks like a deceptively aligned organization just waiting for its chance to oust you. That’s why this consequentialism stuff matters, if you’re thinking about training powerful consequentialist AGI.

Conclusion

We looked at the classic reason people haven’t taken VNM seriously—trivial utility functions—and then talked about why those aren’t a major theoretical worry. Namely, trivial utility functions are like pathological Bayesian priors: they’re possible, but they’re also complicated and ugly, which is the universe telling you that something’s wrong here.

We then presented consequentialism through an analogy to the thermodynamic definition of life, highlighting the general notion of “patterns in the world” rather than negentropy specifically. We said that consequentialists have principled Cartesian boundaries and noted that our universe is not indifferent towards which patterns in it you accumulate. Consequentialists are those entities that keep the good stuff inside.

Putting these two pieces together, objects well-modeled as consequentialists are behaviorally consistent with regard to simple patterns and nearby resources. Objects poorly modeled as consequentialists can only be thought of as being consistent with regard to strangely individuated objects, not to simple patterns or nearby resources.

Finally, we looked at the significance of all this: how being good at monotonically accumulating the patterns-that-matter means you’ll be less prone to letting an “incompetent manager” resource burn down your system. If you’re training powerful consequentialists, you shouldn’t also expect them to be inconsistent with respect to incompetent managers. And when the incompetent manager is you, this is cause for worry.

^
Namely, for the crucial idea (iirc) that powerful language models are behaviorally coherent with respect to their training-relevant patterns, rather than the patterns that we more centrally think of when we think of agents.
^
I’ve left “simplicity” underexplained here, and just claimed that it’s okay to have a “simplicity-weighed prior.” Don’t worry, we’ll return to this!
In short, simplicity will mean being behaviorally coherent with respect to simple patterns (rather than strangely individuated, ad hoc patterns) and with respect to “resources,” the patterns-that-matter.
^
Compare: Daniel Kokotajlo frames agents as “vortices of resources”:
The important thing for understanding agency is understanding that the world contains various self-sustaining chain reactions that feed off instrumental resources like data, money, political power, etc. and spread to acquire more such resources.
^
H/t Scott Viteri
^
It’s not important to our analogy, but the evolutionary reason that reliably accumulating positive charge inside of a cell (and, equivalently, reliably keeping negative charge outside of a cell) matters is that that energy gradient at the cell boundary could then be tapped for other cell functions. Having a steady source of energy to tap gave cells with a net proton-shuttling membrane a competitive edge, and disadvantaged those with discoordinated membranes.

Consequentialists: One-Way Pattern Traps