Abstracting The Hardness of Alignment: Unbounded Atomic Optimization

This post is part of the work done at Conjecture.

Disagree to Agree

(Practically-A-Book Review: Yudkowsky Contra Ngo On Agents, Scott Alexander, 2022)

This is a weird dialogue to start with. It grants so many assumptions about the risk of future AI that most of you probably think both participants are crazy.

(Personal Communication about a conversation with Evan Hubinger, John Wentworth, 2022)

We’d definitely rank proposals very differently, within the “good” ones, but we both thought we’d basically agree on the divide between “any hope at all” and “no hope at all”. The question dividing the “any hope at all” proposals from the “no hope at all” is something like… does this proposal have any theory of change? Any actual model of how it will stop humanity from being wiped out by AI? Or is it just sort of… vaguely mood-affiliating with alignment?

If there’s one thing alignment researchers excel at, it’s disagreeing with each other.

I dislike the term pre paradigmatic, but even I must admit that it captures one obvious feature of the alignment field: the constant debates about the what and the how and the value of different attempts. Recently, we even had a whole sequence of debates, and since I first wrote this post Nate shared his take on why he can’t see any current work in the field actually tackling the problem. More generally, the culture of disagreement and debate and criticism is obvious to anyone reading the AF.

Yet Scott Alexander has a point: behind all these disagreements lies so much agreement! Not only in discriminating the “any hope at all” proposals from the “no hope at all”, as in John’s quote above; agreement also manifests itself in the common components of the different research traditions, for example in their favorite scenarios. When I look at Eliezer’s FOOM, at Paul’s What failure looks like, at Critch’s RAAPs, and at Evan’s Homogeneous takeoffs, the differences and incompatibilities jump to me — yet they still all point in the same general direction. So much so that one can wonder if a significant part of the problem lies outside of the fine details of these debates.

In this post, I start from this hunch — deep commonalities — and craft an abstraction that highlights it: unbounded atomic[1] optimization (abbreviated UAO and pronounced wow). That is, alignment as the problem of dealing with impact on the world (optimization) that is both of unknown magnitude (unbounded) and non-interruptible (atomic). As any model, it is necessarily mistaken in some way; I nonetheless believe it to be a productive mistake, because it reveals both what we can do without the details and what these details give us when they’re filled in. As such, UAO strikes me as a great tool for epistemological vigilance.

I first present UAO in more details; then I show its use as a mental tool by giving four applications:

  • (Convergence of AI Risk) UAO makes clear that the worries about AI Risk don’t come from one particular form of technology or scenario, but from a general principle which we’re pushing towards in a myriad of convergent ways.

  • (Exploration of Conditions for AI Risk) UAO is only a mechanism; but it’s abstraction makes it helpful to study what conditions about the world and how we apply optimization lead to AI Risk

  • (Operationalization Pluralism) UAO, as an abstraction of the problem, admits many distinct operationalizations. It’s thus a great basis on which to build operationalization pluralism.

  • (Distinguishing AI Alignment) Last but not least, UAO answers Alex Flint’s question about the difference between aligning AIs and aligning other entities (like a society).

Thanks to TJ, Alex Flint, John Wentworth, Connor Leahy, Kyle McDonell, Laria Reynolds, Raymond Arnold, Steve Byrnes, Rohin Shah, Evan Hubinger, James Lucassen, Rob Miles, Jamie Bernardi, Lucas Teixeira, and Andrea Motti for discussions on these ideas and comments on drafts.

Pinning UAO down

Let’s first define this abstraction. Unbounded Atomic Optimization, as the name subtly hints, is made of three parts:

  • (Optimization) Pushing the world towards a given set of states

  • (Unboundedness) Finite yet without a known limit

  • (Atomicity) Uninterruptible, happens “as in one step”

Optimization: making the world go your way

Optimization seems to forever elude full deconfusion, but an adaptation of Alex Flint’s proposal will do here: optimization is pushing the world into a set of states.[2] Note that I’m not referring to computational optimization in the sense of a search algorithm; it is about changing the physical world.

When I’m talking about “amount” of optimization, I’m thinking of an underdefined quantity that captures a notion of how much effort/​work/​force is spent in pushing the world towards the target set of states. Here’s a non-exhaustive list of factors that can increase the amount of optimization needed:

  • (Small target set) Hitting a smaller target requires more effort

  • (Far away target set) If there are large changes from the current state to the target set, it takes more effort to reach it.

  • (Stronger guarantees) If the target set must be reach with high probability, it takes more effort

  • (Robustness) If the world must be maintained in the target set, it takes more effort

  • (Finer state-space) If the granularity of states is finer (there are more details in the state descriptions), then it might take more effort to reach the set.

Unboundedness: phase transition in optimization

Humans optimize all the time, as do institutions, animals, economic systems, and many other parts of our world. But however impressive the optimization, it is always severely bounded. We talk about absolute power for a king or an emperor, but none of them managed to avoid death or maintain their will for thousands of years yet (most couldn’t even get their teeth fixed better than paupers).

Classical scenarios of AI risk, on the other hand, stress the unboundedness of the optimization being done. Tiling the whole lightcone with paper clips gives a good example of massive amounts of optimization.

Another example of unbounded optimization common in alignment is manipulation: the AI optimizing for convincing the human of something. We’re decently good at manipulating each other, but there’s still quite clear bounds in our abilities to do so (although some critical theorists and anthropologists would argue we underapproximate the bounds in the real world). If the amount of optimization that can be poured into manipulation is arbitrarily large, though, we have no guarantee that any belief or system of beliefs is safe from that pressure.

More generally, unbounded optimization undermines solutions that are meant to deal with only some reasonable range of force/​effort (like buttresses in structural engineering). So it means that no amount of buttresses is enough to keep the cathedral of our ideals from collapsing.

Atomicity: don’t stop me now

In distributed computing, an atomic operation is one that cannot be observed “in the middle” from another process — either it didn’t happen yet, or it’s already finished. Ensuring atomicity plays a crucial role in abstracting the mess of distributed interleavings, loss of messages, and other joys of the cloud.

I use atomic analogously to mean “uninterruptible in practice”. It might be physically possible to interrupt it, but that would require enormous amounts of resources or solving hard problems like coordination.

In alignment, we’re worried about atomic optimization: the optimization of the world which we can’t interrupt or stop until it finishes.

What does this look like? FOOM works perfectly as an initial example: it instantiates atomicity through exponential growth and speed difference — you can’t stop the AI because it acts both far too smartly and quickly. But the whole point of using atomicity instead of FOOM is to allow other implementations. Paul Christiano (What failure looks like), Evan Hubinger (Homogeneity vs heterogeneity in AI takeoff scenarios) and Andrew Critch (What Multipolar Failure Looks Like) all propose different AI Risks scenarios with atomicity without FOOM. Instead of speed, their atomicity comes from the need to solve a global coordination problem in order to stop the optimization. And coordination is just hard.

Application 1: Highlight Convergence of AI Risk Scenarios

In almost any AI Risk story, you can replace the specific means of optimization with UAO, and the scenario still works.

For me, this highlights a crucial aspect of Alignment and AI Risk: it’s never about the specific story. I get endlessly frustrated when I see people who disagree with AI Risk not because they disagree with the actual arguments, but because they can’t imagine something like FOOM ever happening, or judge it too improbable.[3]

The problem with this take is not that FOOM is obviously what’s going to happen with overwhelming probability (I’m quite unconvinced of that), but that it doesn’t matter how UAO is implemented — as long as we have it, we’re in trouble.

And because UAO based arguments abstract many (all?) of the concrete ones, they are at least as probable (and probably strictly more probable) as any of them. Not only that, they even gain from new formulations and scenarios, as these offer additional mechanisms for implementing UAO. So having a variety of takeoff speeds, development models, and scenarios turns from a curse to a boon!

What this also entails is that to judge the probability of these risks, we need to assess how probable UAO is, in any implementation.

Convergence to UAO

To start with unboundedness, it follows straightforwardly from technological progress. Humanity is getting better and better at shaping the world according to its whims. You might answer that this leads to many unwanted consequences, but that’s kind of the point, isn’t it? At least no one can say that we don’t have a massive impact on the world!

This is also where AI gets back into the picture: ML and other forms of AI are particularly strong modern ways of applying optimization to the world.[4] And we currently have no idea where it stops. Add to that the induction from past human successes that huge gains can come from insights into how to think about a problem, and you have a recipe for massively unbounded optimization in the future.

As for atomicity, it has traditionally been instantiated through three means in AI Risk arguments:

  • (Computers running faster than humans) Current computers are able to do far more, and faster, than humans if given the right instructions, and Moore’s law and other trends don’t give us good reason to expect the gap to stop growing. This massive advantage incentivizes progress in optimizing the world to go through computers, making the optimization increasingly atomic.[5] More generally, human supervision becomes the bottleneck in any automated setting, and so risks getting removed to improve efficiency if the system looks good enough.

  • (Inscrutability pushed by competitiveness) Ideally, everyone would want to be able to understand completely what their AI does in all situations. This would clearly help provide guarantees for customers and iterate faster. But the reality of ML is that getting even half there is extraordinarily difficult, it costs a lot of time and energy, and you can get amazing results without understanding anything that the model does. So competitiveness conspires to push AGI developers to trade interpretability and understanding for more impressive and marketable capabilities. This gulf between what we can build and what we can understand fuels atomicity, as we don’t have the mental toolkits to check what is happening during the optimization even if it is physically possible.

  • (Coordination failures) Humans are not particularly good at agreeing with each other in high stakes settings. So if the only way to stop the ongoing optimization is an economy-wide decision, or an agreement to not use a certain type of model, we should expect enormous difficulties there. And in a situation (the use of more and more optimization in the world) where free riders can expect to reap literally all the benefits (if they don’t die in the process), it’s even harder to agree to stop an arms race.[6]

The gist is that we’re getting better and better at optimizing, through technology in general and computers and automation in particular. This in turn leads to a more and more atomic use of optimization, due to the high speed of computers and the incentives to automate. With the compounding effect of the difficulty to coordinate, we have an arms race for building more and more atomic optimization power, leading to virtually unbounded atomic optimization.

Application 2: Explore Conditions for AI Risk

While UAO is a crucial ingredient of AI Risk, it is not enough: most scenarios need some constraints on how UAO is applied. The abstraction of UAO lets us then focus on exploring these conditions, to better understand the alignment problem. As such, UAO provides a crucial tool for epistemological vigilance on the assumptions underlying our risk scenarios.

Let’s look at two classes of proxies for an example: overapproximation proxies and utility maximization proxies. These two capture many of the concrete proxies that are used in AI Risk scenarios, and illustrate well how UAO can clarify where to investigate.

The Danger of Overapproximations

Overapproximation proxies point to quite reasonable and non-world-shattering results, like “Make me rich”.

Here are their defining properties:

  • (Looseness) The proxy is one where the target set massively overapproximates the set of states that we actually want. For example, “Make me rich” as operationalized by “Make my bank account show a 10 digits number” contains a lot of states that we don’t really want (including those where I’m dead, or where the Earth is turned into computers that still somehow keep track of my bank account). The set of states I’m thinking of only makes a particularly small subset of the proxy’s set.

  • (Reliability) The proxy asks for reaching the set of states with high probability

  • (Robustness) The proxy asks that the change sticks, it doesn’t get out of the target state after the optimization is done.

Let’s look at what happens when we apply UAO to such proxies. Our proxy gives us a fixed, overapproximated target set of states. Let’s say something like “produce 20 billion paperclips in the United States per year” (about twice the current amount). You don’t need to tile the universe to reach that target at all. So it’s relatively easy to end up in the set of states we’re aiming for. But what about reliability and robustness, the other two requirements of the proxy? Well if you want to guarantee that you’ll reach the target set and not get out of it, one way to do so is to aim for the part of this target state that is more controlled and guaranteed.[7] Like for example, the one where the Earth is restructured for better paperclip-making conditions (without these bothering humans for example!). As the optimization increases, it is increasingly spent on reliability and robustness, which strongly incentivizes using the many degrees of freedom to guarantee the result and its perennity. Hello instrumental convergence!

The story is thus: unbounded atomic optimization + overapproximate proxies ⇒ incentive for numerous degrees of freedom to be used in systematically bad ways.

Note that if we want to avoid this fate, our abstract conditions give us multiple points of intervention:

  • We want to change the proxy to have better properties.

  • We might question whether realistic proxies have these properties.

  • We might want to fill in details of how the UAO is used, to break the argument somewhere in the middle.

  • We might want to remove or simplify some of the constraints on the proxy, to see if we can strengthen the argument by weakening its hypotheses.

  • We might want to question the actual strength of the incentives, to break the argument in the abstract.

Terrible Returns on Utility

Utility maximization proxies are specified by the maximal states according to some utility function. It should come to no surprise to readers of this post that maximizing utility can lead to terrible outcomes — the question is: what is needed for that to happen?

This part shows more how UAO can lead to asking relevant questions. My current best guess is that we also need two conditions on the proxy:

  • (Beyond the goal) The utility function is such that the actual states of the world that we want, the ones that we’re visualizing when coming up with the utility function, have far less than maximal utility. So when I ask to maximize money, the amounts I have in mind are far smaller than the ones that can be reached in the limit (by converting all the universe into computronium and thus encoding a massive number for example).

  • (Terrible upper set) There is a threshold of utility that is physically reachable and such that every state with at least that much utility is terrible for us.

With these two conditions, it follows that UAO will push us into the terrible upper set, and lead to catastrophic AI Risk.

The interesting bit here lies in analyzing these conditions for actual utility functions, like “maximizing paperclips”. And just like with the overapproximation proxies, multiple points of interventions emerge from this analysis:

  • We want to change the proxy to have better properties.

  • We might question whether realistic proxies have these properties.

  • We might want to fill in details of how the UAO is used, to break the argument somewhere in the middle.

  • We might want to remove or simplify some of the constraints on the proxy, to see if we can strengthen the argument by weakening its hypotheses.

Application 3: Anchor Operationalization Pluralism

In my last post, I discussed different levels at which pluralism might be applied and justified. The one that UAO is relevant to in my opinion is operationalization pluralism, or pursuing multiple operationalization (frames/​perspectives/​ways of filling the details) for the same problem.

Because the tricky part in operationalization pluralism is to capture the problem abstractly enough to allow multiple operationalization, without losing the important aspects of the problem.

UAO provides one candidate abstraction for the alignment problem.

In some sense, UAO acts as a fountain of knowledge: it rederives known operationalizations when you fill in the implementation details or make additional assumptions. As such, it serves both as a concrete map and as a tool to explore the untapped operationalizations. We can pick unused assumptions, and generate the corresponding operationalization of the alignment problem.

Three concrete ways of generating operationalizations are

  • Specifying the implementation details to make the problem more concrete.

  • Staying at the abstract level, but privileging the study of one intervention on how UAO will be applied.

  • Starting with epistemic tools, and operationalizing UAO in the way that is most susceptible to yielding to these tools.

Let’s look at examples of all three in alignment research.

Filling in the blanks: neural nets, brain-like algorithms and seed AI

The obvious way of operationalizing UAO is to make it concrete. This is exactly what Prosaic Alignment, Steve Byrnes’ Brain-like AGI Alignment and some of MIRI’s early work on seed AIs do.

  • Prosaic Alignment assumes that UAO will be instantiated through neural networks trained by gradient descent. It is somewhat agnostic to architecture and additional tricks, as long as these don’t cross a fuzzy boundary around a paradigm shift.[8] Also, despite Paul Christiano’s doubts about FOOM, prosaic alignment doesn’t forbid FOOM-like scenarios.

  • Steve Byrnes’ research assumes that UAO will be instantiated through reimplementing the learning algorithms that the human neocortex uses. It doesn’t really specify how they will be implemented (not necessarily neural nets but might be).

  • MIRI’s early work (for example modal combat and work on Loeb’s theorem) assumed that UAO would be instantiated through hand-written AI programs that were just good enough to improve themselves slightly, leading to an intelligence explosion (with a bunch of other assumptions). A running joke was that it would be coded in LISP, but implementation details didn’t really matter, so long as the initial code of the seed was human intelligible and human crafted.

  • Critch’s RAAPs assumes that UAO is instantiated through structure, that is through the economy itself (unbounded atomic capitalism, if you prefer).

These assumptions were historically made from a normative perspective: each researcher believed that this kind of AI was either the most probable, or had a significant enough probability to warrant study and investigation.[9]

But here we’re starting from UAO instead. By making these additional assumptions, each operationalization unlocks new ways of framing and exploring the problem. As an analogy, in programming language theory, the more generic a type, the less you can do with it; and the more specific it becomes, the more methods and functions can be used on it. So if we assume that UAO will be instantiated as neural networks trained by gradient descent, we have more handles for exploring the general problem and investigating mechanisms. A perfect example is the small research tradition around gradient hacking, which looks for very concrete neural networks implementations of a certain type of treacherous turn incentivized by instrumental convergence.

Yet there are also risks involved in such an instantiation. First, if the instance is a far simpler case than the ones we will have to deal with, this is an argument against the relevance of solving that instance. And more insidiously, what can look like an instantiation might just pose a completely different problem. That’s one failure mode when people try to anchor alignment in ML and end up solving purely bounded optimization problems without any theory of change about the influence on unbounded atomic optimization.[10]

Working directly on the abstraction

Another category of operationalizations stays at the abstract level, and focuses instead on one possible intervention on UAO as the royal road to alignment. A lot of the work published on the AF fits this category, including almost all deconfusion.[11] Among others, there are:

  • John Wentworth’s work on Abstraction and the Natural Abstraction Hypothesis, which focuses on finding True Names for human values, in order to not have proxies but the real deal.

  • Quintin Pope, Alex Turner, Charles Foster, and Logan Smith’s work on shard theory, which focuses on a structural way of counterpowers which allow more optimization to be spent without the classical failure modes.

  • Stuart Armstrong’s work on Model-Splintering, which focuses on how to extend values when more optimization leads to shifts in ontology.

The tricky part is that so much of the work at this level looks like fundamental science: it’s about exploring the problem almost as a natural object, in the way computer scientists would study a complexity class and its complete problems. In the best cases, this level of abstraction can yield its secrets to simple and powerful ideas, like “high-level summary statistics at a distance” or “counting options through permutations”. But even then, drawing conclusions for the solution of the problem is hard, and requires epistemological vigilance.

That being said, such work still plays a crucial role in alignment research, and we definitely need more of it. Even when working from within an instantiation like prosaic alignment, it’s often fruitful to move between this level and the more concrete. I conjecture that it comes both from the purity of the models used (which leads to focus on nice math) and from removing the details that obscure or hide the core of unbounded atomic optimization.

Privileging particular tools

The last category in my non-exhaustive list are those operationalizations which start from their methods and the veins of evidence where they go searching for hidden bits.

  • Vanessa Kosoy’s work is the most obvious example to me, with a focus on extracting knowledge and solutions through computational learning theory. This also comes with some instantiation assumptions (that the AI is a Bayesian RL model), but those are significantly less concrete and constrained than in prosaic alignment, for example.

  • Andrew Critch’s RAAPs looks like a framing of alignment and UAO fitted to the analysis of structures, from sociology to computational social choice.

  • Steve Byrnes’s work also fit as a research programme driven by neuroscience. I don’t know if he would agree with this characterization, but it still looks like a fruitful framing to me.

  • A tradition of mostly MIRI work but also some CLR and some independent research focuses on decision theory and how it can clarify issues around the alignment and the consequences of UAO.

Here the risks are to take an irrelevant field, or one with only superficial links to alignment. I think it’s possible to analyze the expected productivity of an analogy, for example based on the successes in that field. Also relevant, if the field in question doesn’t have many successes, is whether the analogy reduces alignment to a currently really hard problem (like P vs NP), or to some simpler problem that these other fields have a reasonable chance to tackle.

My attitude to this category of operationalization is that we should look for even more opportunities and bring as many analogies as we can, as long as we expect them to become productive for alignment. The PIBBSS Fellowship is pushing in that direction, and I expect a clearer framing of the constraints to help.

Application 4: Separate AI Alignment From Other Forms of Alignment

As a final application of UAO, let’s separate alignment of AIs from other forms of alignment.

Here I want to turn to Alex Flint’s nice analysis of Alignment vs AI Alignment,[12] where he attempts to separate aligning AI from alignment of other systems like oneself or society. Concretely, his non-AI examples are:

  • Aligning a society through property rights

  • Aligning a society through laws

  • Aligning one’s own cognition through habit formation

  • Aligning a company via policy and incentives

  • Aligning animals through selective breeding and domestication in general

Alex then asks what separates aligning an AI from all these examples.

My answer: the combination of unboundedness and atomicity in the optimization. In all these examples, unbounded optimization applied atomically is irrelevant. In principle each example can be optimized somewhat unboundedly, but it happens so slowly that we can iterate — an assumption requiring epistemological vigilance in alignment.

Or said differently, it’s unbounded optimization but applied little by little, with time to change course in between. Just like cathedral builders could see cracks and failure happening over the course of decades and correct them.

Note that this doesn’t mean these fields can’t help with alignment. Just that alignment is qualitatively different from the phenomena traditionally tackled by economics, behavior change, and these other fields. This difference must be kept in mind when building a theory of change for applying insights from these other disciplines.

UAO, a Productive Mistake

We’ve seen that unbounded atomic optimization serves in multiple applications:

  • It highlights the convergence of multiple concrete instantiations of AI Risk by abstracting them all.

  • It helps in formulating and exploring conditions for AI Risk.

  • It gives a framing for operational pluralism in alignment.

  • It separates AI alignment from alignment of other systems.

This makes me think that UAO is a productive mistake.

How is it a mistake? That is, what does it hide away or distort? Mostly it assumes the hardness of the problem. Some people believe that alignment is significantly easier than dealing with UAO — maybe the increases in optimization between iteration of AIs will be slow enough to adapt and break atomicity, for example. I’m personally dubious of such simplifications, as they look more like wishful thinking than arguments to me. But UAO is definitely colored by my takes, and my general stance towards epistemological vigilance.

Still, UAO can act as a characterization of the hard alignment problem that is more conducive to debates about the difficulty of alignment and the assumptions we can get away with.

  1. ^

    Here the word “atomic” refers to the etymological meaning “indivisible”, rather than the common usage “small”

  2. ^

    This setting can deal with utility functions by focusing on the sets with maximal utility (which exists because there are finitely many states).

  3. ^

    How do I know that they might agree with the actual argument? Because most often, when I then present them a more structural implementation of UAO like Critch’s RAAPs, they end up agreeing with the risks!

  4. ^

    Here again, it’s important to note that I’m using optimization in the “physically changing the world” sense, not in the computational “internal search” sense. So what AI gives us here is the ability to “internally search” for better ways of acting in the world, and this whole process fits under what I call optimization.

  5. ^

    This is where the atomicity comes from in fast takeoffs and FOOM-like scenarios.

  6. ^

    Exploring these structural factors is the big contribution of Critch’s RAAPs in my opinion.

  7. ^

    This is but another way of framing Bostrom’s insightful point about how even a wireheading AI would have reasons to tile the universe to protect itself and its wireheading.

  8. ^

    Important to note that this subclass of alignment is comparatively far larger (at least in terms of active research) than the other two, and has additional specializations (for example whether the NN will be trained by RL or self-supervised learning).

  9. ^

    Critch feels like a strong exception, because I interpret his introduction of RAAPs as an attempt to add structural perspective to alignment to round off the field. And although Paul believes in the normative claim that the first AGI will probably be prosaic, he does argue that even if that’s not the case, we should expect a solution to prosaic alignment to translate to the other version and capture some hard parts of the problem. And when I asked him the question, he told me that what mattered was to make the problem well-defined.

  10. ^

    See this post for an exploration of the common assumptions that need to be questioned in alignment to not fall into this trap.

  11. ^

    Some exceptions are Evan Hubinger’s et al. inner optimization and Paul Christiano’s universality, which are tailored for prosaic alignment. Yet they end up being useful for other approaches too.

  12. ^

    Discussions with Alex while he was writing that post ultimately led me to realizing the need for the atomicity condition, so he gets the credit for that!