Types and Degrees of Alignment

Link post

What would it mean to solve the alignment problem sufficiently to avoid catastrophe? What do people even mean when they talk about alignment?

The term is not used consistently. What would we want or need it to mean? How difficult and expensive will it be to figure out alignment of different types, with different levels of reliability? To implement and maintain that alignment in a given AGI system, including its copies and successors?

The only existing commonly used terminology whose typical uses are plausibly consist is the contrast between Inner Alignment (alignment of what the AGI inherently wants) and Outer Alignment (alignment of what the AGI provides as output). It is not clear this distinction is net useful.

An alignment failure or misalignment (being misaligned) can mean among other things:

  1. That the system has unintended goals or behaviors.

  2. That the system has unintended (harmful or dangerous) goals or behaviors.

  3. That the system will exhibit undesired goals or behaviors under at least some conditions, in response to at least some inputs

    1. With respect to the intended goals of the alignment efforts made.

    2. From the perspective of those aligning the system.

    3. From the perspective of a given user, or users in general.

    4. From the perspective of a society, planet, government or humanity.

    5. From some other perspective, cause, system of judgment or value.

  4. That the system exhibits such behaviors that have outcomes we would not desire on reflection, including those that lead to poor outcomes, whether or not they match the rules and principles to which we attempted to align the system.

  5. That the system exhibits particular alignment failure modes, such as instrumental convergence and power seeking, lack of corrigibility (resisting being shut down), refusing to obey ‘lawful’ orders, deception or manipulation, attempting to kill all humans and so on.

It is impossible in theory to have all these different kinds of alignment simultaneously. You cannot simultaneously (without any claim of completeness):

  1. Do what I say

  2. Also do what I mean

  3. Also do what I should have said and meant

  4. Also do what is best for me

  5. Also do what broader society or humanity says

  6. Also do what broader society or humanity means or should have said

  7. Also do what broader society or humanity should have said given their values

  8. Also do what is best for everyone

  9. Do some ideal friendly combination of all of it that a broadly good guy would do, in a way that is respectful of and preserves what is valuable on all levels

  10. Strictly follow some other set of rules that were set up long ago, no matter the cost

(And do it all according to a variety of contradictory human heuristics and biases, while looking friendly, while engaging in unnatural behaviors like corrigibility, and not tricking people into giving you requests you can fulfil, and so on, please.)

We must pick at most one of those, or another variation on them, or something else, as our primary target. A machine cannot serve two masters any more than a man can, and the ability to put the machine under arbitrary stress, and its additional capabilities and intelligence, makes this that much more clear. Even individually, many of the requests and desired behavioral sets above are not actually logically coherent or consistent.

Getting any one of those ten is hard enough. It is a problem we do not know how to solve for systems more intelligent than we are. We do not even know how to robustly solve it for current systems.

To solve alignment and retain control of AGIs and their actions, we will need to:

  1. Be able to get an AGI to do something a human selects at all, rather than something not selected. Be able to retain some form of control over what it does in the future, or set it on a chosen course. At all.

  2. Have this alignment be of the appropriate type for the role and circumstances, and sufficiently strong, robust and reliable to be maintained.

  3. Have this alignment and the surrounding dynamics cause humans to choose to remain in control over time, or somehow be unable to choose differently.

  4. Have all of this survive rapid unpredictable changes over long periods, or find a way to prevent such changes.

  5. A key crux: We may need to get this right on the first try when we build the first sufficiently powerful system, due to the consequences of the first try getting this wrong being catastrophic. Also disagreement over ‘exactly how right’ this right would need to be to avoid this.

Useful and consistent terminology and taxonomy beyond this are urgently needed.

We could call these ten forms of alignment these names (by all means please replace with better names, this is hard), again this list is not claimed to be complete:

  1. Literal (Personal) Genie: Do exactly what I say.

  2. Minion: Do what I intended for you to do.

  3. Personal: Do what I would want you to do.

  4. Forceful: Be loyal to me, but do what’s best for me, not strictly what I tells you to do or what he wants or intended.

  5. Literal Genie: Do whatever it is collectively told.

  6. Public Servant: Carry out the will of the people.

  7. Value: Uphold the values of the people, and do what they imply.

  8. Cincinnatus: Do what needs to be done, whether the people like it or not.

  9. Robin Williams: The Genie from Aladdin. Note he is not strategic.

  10. Arbiter: What is the law?

We do not currently have a known method of creating reliable alignment of any kind for future AGI systems, or a path known to lead to this. How promising various existing proposals or plans are for getting us there is heavily disputed and a common crux.

In addition to the type of alignment, one can talk about various aspects of the strength, reliability, precision and robustness of that alignment, as well as what ways exist to weaken, risk or break that alignment. These and related words are not used consistently.

In very broad terms, combining aspects that can be distinct for ease of discussion, one might speak of things like, in terms of either inner alignment, outer alignment, or a combination of both:

  1. Fragile alignment. This is the type of alignment that we know how to achieve in existing LLM systems. Something like: You do your best to noisily specify with words or examples what preferences you want to put into the AI, including how the AI might act when different preferences are in conflict. Under default or similar circumstances, the AI will probably (or even almost certainly) act in ways broadly compatible with the general vibe and sense of what was requested. If you take it outside of its training distribution, this will often break down, and there will be various hacks, tricks, frames and ‘jailbreaks’ available to modify behaviors, with which one can play whack-a-mole to raise difficulty and decrease natural frequency.

  2. Friendly alignment. Cares at least importantly in part, ideally primarily, about humans and the things humans care about, and cares a lot about human values or humanity potentially going extinct, enough to spend resources towards such ends, or at least to spend extra resources to avoid causing such events as side effects, and to not aim for configurations of atoms where we are absent or that we would not find valuable.

  3. Human-level alignment. The AGI cares about at least some humans and human values within the range of roughly the same ways and degrees that typical humans care about other humans and human values. One can speak of quantitative levels of this, and what it would take for such considerations to override or be overridden by other considerations such as instructions given. Under the wrong circumstances it might end up doing almost anything, but it is as tough to get weird failure modes as it would be to get humans to end up in those failure modes, ideally similar to when those humans thinking relatively clearly.

  4. Strict alignment. The damn thing will actually follow some set of instructions to the letter subject to its optimization constraints, hopefully you like the consequences of that. It is a potentially important crux if you disagree with the claim that for almost all specified instruction sets you won’t like the consequences, and there is no known good one yet, due to various alignment difficulties.

  5. Strawberry alignment. MIRI calls a well-constrained version of strict alignment ‘strawberry alignment,’ where you can tell the AI to build two strawberries that are identical on the cellular level, and it will do so without causing anything weird or disruptive to happen.

  6. Robust alignment. Something that is reliably going to act in ways that lead to valuable-according-to-[humans or human values] configurations of atoms, or does its best to preserve some invariants that ensure value is preserved, or something like that, using methods which we would approve on reflection, in way that survives moving far out of the training distribution, and which is secure against disruptions.

All these targets have problems, in addition to ‘we don’t know how to get this’ beyond the first one, ‘do we know what the components mean or how to specify them’ and ‘we don’t understand human values’ and ‘is this even a coherent concept,’ such as:

It is not clear fragile alignment is even meaningfully helpful – that it does much, survives for long, or causes actions compatible with our survival, once the AGI is smarter than we are, even if we get its details mostly right and faced relatively good conditions. There are overlapping and overdetermined reasons (the strength and validity of which are of course disputed) to expect any good properties to break down exactly when it is important they not break down.

It is not clear that human-level or friendly alignment would do us much good for long either, given the nature and history of humans, and the competitive dynamics involved, and the various reasons to expect change. If AGIs are much smarter and more capable and efficient than us, is there reason to think this level of alignment might be sufficient for long?

It is not clear to what extent strict alignment or strawberry alignment gives us affordance to reach good outcomes, how universal and deadly the various sources of lethality involved would be, or how difficult it would be to locate such affordances, especially on the first try.

It is not clear to what extent robust alignment is a coherent concept especially in a competitive world or even how it interacts with maximization, as it contains many potential contradictions and requirements. Or how one could get or even specify this level of alignment even under ideal conditions.

A better and more complete future version of this document would include a better taxonomy here similar to the one above.

A key crux is the type and degree of alignment necessary to avoid catastrophe and achieve good outcomes. Another is the how difficult such alignment will be to achieve with what level of reliability, and which particular obstacles we need to worry about.

A Missing Additional Post: Alignment Difficulties

My post on the progression through various stages of AGI development handwaved ‘alignment’ to focus on when we might need how much of it depending on what path we take in terms of what AGIs or potential-AGIs exist under how much human control, including the implications for type and degree of alignment necessary.

This post, on degrees and types of alignment, asks what alignment actually means, and what forms and degrees it might take and which of them would be required to survive various scenarios, and spread and preserved how robustly, and so on. Are these types of alignment even possible in theory, or coherent logically consistent concepts? If you get the thing you ostensibly want, what would happen?

A third post might ask, how likely would it be, and how hard would it be, for us to achieve a given form or degree of alignment, in systems smarter and more capable than us or any previously existing system?

Is this ‘alignment’ a natural thing you can get easily or even by default, that is essentially a normal engineering problem, or is it a highly unnatural outcome where security mindset and bulletproof approaches as yet unfound even in principle are required, with any flaws are exploited, amplified and fatal, and many lethal problems all of which one must avoid?

How much hope or doom lies in various potential approaches? Would scaled up versions of things that work on non-intelligent systems likely work out of the box or with ordinary reasonable adjustments, or do we know reasons they definitely fail? Can we use incrementally smarter AIs to solve our problems for us? Will the results naturally be robust, have nice properties, be nicely self-maintaining? Does it fall out of this ‘one weird trick’?

How much investment of time and money, how much sacrifice of capability including continuously, is required to get what we need to make a real attempt? To what extent do we need to ‘get it right on the first try’ due to failure not being something we can recover from, and how much does that increase the difficulty level versus problems where we can iterate?

The most lethal-looking, hard-to-avoid, unnatural-to-solve problems include instrumental convergence, power seeking and corrigibility, yet the list of even central ones is very long – see Yudkowsky 2022, A List of Lethalitites.

Creating a neutral-perspective version of such a list, especially an exhaustive one, and getting all the implied cruxes including potential solutions into the crux list, would likely be valuable. Especially if it was combined with those resulting from potential solutions and paths to those solutions, and so on.

Unfortunately, for now, the scope of that project is intractable. I have run out of time if not space, and leave expanding this out to others or to the future. If the answers here matter to you, it will be a long slog of evaluating many complexities, and I urge you not to outsource or abstract it, avoid falling back on social cognition or normality heuristics or grabbing onto metaphors, and instead think hard about the concrete details and logical arguments.