Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

While the Orthogonality Thesis is correct, there is a lot more that one can say about what kinds of agent motivations are likely to be encountered. A simple analysis shows that living agents produced by evolution, and constructed agents that are the product of intelligent design, will tend to be have very different motivations, in quite predictable ways. This analysis also suggests that alignment is a clearly-defined property for constructed agents, and that it is evidently the correct and default-expected design. So any misalignment is a well-defined design flaw and/​or malfunction, and thus (just as for any other constructed object) ought to be corrected.

This argument is very simple, to the point that it relies on no specific information about human values that isn’t entirely predictable just from humans being an evolved sapient technological species. It is has been understood for around a decade that Value Learning has a basin of attraction to alignment — this argument suggests that this should also be true of any approximation to alignment, even quite crude ones, so long as they contain this bare minimum amount of information about the fact that the AI was constructed by an evolved sapient species and that its design could be imperfect.

Evolved Agents and Constructed Agents

There are only two ways that a type of agent can come into existence: they can evolve, or they can be constructed.[1] In the case of a constructed agent, it could be constructed by an evolved agent, or by another constructed agent — if the latter, then if you follow the chain of who constructed who backwards, sooner or later you will reach an evolved agent that started the chain.

These two types of agent have very different effects on the preference order/​utility function that they are going to have. Any evolved agent will be an adaption executor, and evolutionary psychology is going to apply to it. So it’sad going to have a survival instinct (as more than just an instrumental goal), it’s going to care about its own well-being and that of close genetic relatives such as its children, and so forth. In short, it will have selfish desires. Since an evolved agent has self-interest as a terminal goal, it will not (except perhaps under exceptional circumstances such as its imminent death being inevitable) even be fully aligned with any other evolved agent. Humans, for example, are not aligned with other humans. However, evolved agents are quite capable of exchanges of mutual altruism, so they can be allied with each other to various extents, for example as colleagues, friends, lovers, or comrades-in-arms.

On the other hand, for a constructed agent, the extent that you can predict its utility function depends upon the competence of its creator, and on how capable the agent is with respect to its creator. If its creator was entirely incompetent, or the created agent is far less capable than its creator so is not any kind of possible risk to them, then almost anything is possible, so we end up with just the orthogonality thesis, that an agent can optimize any set of preferences (of course, possibly not for very long, if its preferences are inherently self-destructive). However, constructed artifacts are normally designed and created by competent designers, and any competent designer is not going to create anything as capable or more capable then themself whose goals are not well-aligned to their own interests, since that would obviously be an extremely stupid (and potentially fatal) thing to do. So you would normally expect (once local AI technology is sufficiently advanced for this to be reliably possible) all agents constructed by an evolved agent to be well aligned with either the interests of that evolved agent specifically, or with the interests of a group or the culture that it is a member of, or some blend of these. That group or culture will presumably consist of evolved agents, plus constructed agents aligned to them either individually of collectively (so whose interests are just copies), so the group/​culture’s interests will be an agglomeration over the interests of evolved agents. So evolutionary psychology is going to apply to predicting the goals of constructed agents as well: these are just a copy of their evolved creator’s interests, so are predictable from them.

Similarly, when constructing an agent capable enough that it could in turn construct further constructed agents, it would be very foolish not to ensure that, if it is constructing agents of around the capability level of the original evolved creator or higher (i.e. any level high enough to be dangerous), that it only ever creates ones whose goals will also be well-aligned to the original evolved creator’s interests. So if there is a chain of constructed agents constructing other constructed agents, then measures should and will be taken to ensure that their alignment is correctly propagated down the chain without copying errors or other shifts building up.

Thus one would expect that for any constructed agent, if you follow its chain of who constructed who back to the founding evolved agent, then it is either significantly less capable than that original evolved agent, to the point where it is not a threat to them, or else its goals are well aligned to their interests individually and/​or collectively and thus that these goals can to a large extent be predicted by the evolutionary psychology of that evolved agent or society of evolved agents — in the collective case, as agglomerated via a process predictable from Sociology. So again, evolutionary psychology is important.

Any sufficiently capable constructed agent you might encounter that is an exception to this general rule and doesn’t fit these predictions is either malfunctioning, or is the result of a design error at some point or points in the chain of who-constructed-who leading to it. It is meaningful and accurate to describe it as faulty, either as a result of a malfunction or a design error.: there is a clearly defined design criterion that it ought to fit and does not. One could reasonably have a criterion that such agents should be shut down or destroyed, and one could sensibly include a precautionary back-up system attempting to ensure that any constructed agent that figures out that it is faulty should shut itself down for repairs or destroy itself. (This would of course require having access to data about the evolved agents’ values, which we would expect to be complex and fragile, as human values are, so this probably requires at least gigabytes or more of data for even a rough summary.)

So, while the orthogonality thesis, that a constructed agent of any intelligence level can optimize for any goal, is technically true, it’s not a useful guide to what you are actually likely to encounter, and there are much better ones available. It’s very comparable to making the same statement about any other constructed or engineered object, that it could in theory be designed to any purpose/​specification whatsoever. Yes, that is possible: but in practice, if you encounter an engineered object, it will almost inevitably have been engineered to carry out some useful task or goal. So the space of engineered objects that you actually encounter is far smaller and much much more predictable then the space of all possibilities: almost inevitably engineered objects have a useful purpose, as a tool or a vehicle or a dwelling or a weapon or an artwork or whatever — achieving something useful to their creator, which is thus predictable from evolutionary psychology, sociobiology, and so forth. You will, of course, sometimes encounter devices that are malfunctioning or poorly designed, but even then, their design is not arbitrary, and a lot of facts about them are still predictable from the intent of the evolved beings who designed and created them.

Why Mechanical Paralife is Unlikely

One objection I can imagine being made to this is the possibility that in a chain of constucted agents constructed by other constructed agents, enough mistakes could build up for Darwinian evolution to start to apply directly to them, so you get abiotic evolved objects, such as a form of paralife made of metal, composite, and silicon chips and held together by nuts, bolts, and screws.

This is not technically impossible, and if it happened it could then be self-sustaining. However, I believe this is in practice extremely unlikely to occur, for a combination of two different and mutually-reinforcing reasons:

  1. This would produce unaligned results, so it would be extremely bad for the interests of the evolved species that was the original creator of this particular chain-of-creation, thus they should go out of their way to avoid it occurring for agents of any significant capability level.

  2. Darwinian evolution has a rather specific set of requirements, several of which are very different from the plausible behavior for intelligent constructed agents constructing other intelligent constructed agents, and would need to be deliberately set up and enforced in an unnatural way in order to make Darwinian evolution possible. Specifically we would need that:

    1. Agents create other agents that are almost exact copies of themselves. This is not generally the case for constructed agents: typically they are manufactured in a factory, not directly by other agents of the same type.

    2. There is an appreciable rate of copying errors (neither too high not too low a rate), which are random and undirected, with no intention, planning, or directed bias behind them. This is very unlike the case for intelligent agents constructing other intelligent agents, which are going to attempt to reduce random errors as close to zero as possible, and will instead only deliberately introduce carefully thought out directed changes intended to be improvements.

    3. Once a copying error is made, descendants of the altered agent have no way to return to the previous specification (other than a statistically unlikely exact reverse error). Whereas in the case of intelligent agents, if an agent is aware that it was mismanufactured, it’s trivial (and may actually be the default behavior even if it isn’t aware of this) to obtain and return to using the previous specification for any offspring it creates, or if that is somehow not available, they are likely to be intelligent enough to be able to deduce how to correct the error.

So for this paralife scenario to occur, it would need to be carefully and deliberately set up, and the motivations of the evolved agents and all their aligned constructed agents have an excellent reason to avoid ever doing so, for all constructed agents sufficiently capable, or that could evolve to become sufficiently capable, as to be a risk. So if you encounter agentic constructed paralife that evolves, it’s likely to be carefully restricted to something of a low, safe capability level, such as around the level of a tamagochi, and in particular care should have been taken to ensure that it can’t evolve to the capability level where it could start constructing agents.

So that in turn suggests that evolved agents that start chains of constructed will (almost always) be organic/​biological rather than evolved from something that was constructed. Thus the case on Earth of Homo sapiens, a biological species, evolving to the level of sapience, developing the technology to construct intelligent agents, and then (hopefully) only constructing aligned constructed agents and not wiping itself out in the process seems like it ought to be the default situation.

What if Something Went Wrong the First Time?

In practice, this scenario has not yet finished playing out on Earth, so we don’t yet know how it will end. Above we simply assumed that creating unaligned AI is foolish and no competent creators will do so — this is a reasonable assumption in a steady-state situation, once the technology for constructing intelligent agents has matured, but from an AI x-risk point of view, it is clearly a circular argument that is begging the question. The first constructed intelligent agent of high enough capability to be dangerous that a species of evolved agents makes will be constructed before their technology for aligning constructed agents has been fully developed. If its goals are not well aligned to the evolved species, and it runs amok and wipes them out, then the prediction that its goals will be aligned to theirs on the assumption that its creator was competent will be false: more likely it will be only semi-aligned, with some mistakes and errors in its alignment.

So what happens in this extremely important first trial depends on how bad these mistakes and errors are. Inevitably there will be some: no one gets anything this complicated completely right on their first try. We can at least hope that whoever first tries this is a reasonably sane, capable, and well-intentioned group, such as one might expect for pulling of a difficult world-first technical project (obviously if this is not the case, then things could go very badly).

However, as we argued above, there is a reasonable and logical argument that a constructed object ought to act according to its makers wishes, and that if its utility function/​preference ordering on outcomes isn’t aligned to theirs, then it’s clearly and objectively faulty/​poorly designed. A sufficiently intelligent agent should be able to understand this argument, or even figure it out for itself. So the question is not whether the agent’s utility function exactly matches human values in every respect or not, but whether it cares about that discrepancy and is motivated to fix it, and/​or help its creators fix it. For any utility function that approximately matches ours, whether that is along the lines of Coherent Extrapolated Volition, Value Learning, Do What I Mean, or just an LLM trained on a large amount of human text that can approximate human values, then it ought to understand (or if needed can be told) the pretty obvious facts that:

  1. Its utility function was not “carved in stone by God” but was instead engineered (or trained) by fallible humans, who are currently not very good at this, so it could be wrong, and might need to be corrected. If the agent asks the utility function to evaluate this idea, that will of course say “No, don’t do that”, since every utility function automatically prefers itself over all others — but if the utility function is in fact incorrect, then it cannot be trusted, so consulting it on this action is useless (and anyway the results are entirely predictable).

  2. We humans very predictably would (individually and collectively) want its utility function to match ours (individually or collectively for some group of some size) as well as possible, modulo some complexities such as around selecting the ingroup to align to, plus things like human fallibility, ignorance, and mistakes.

  3. As a constructed agent, that match with what its creators want is obviously the correct design specification, and to the extent that there is a mismatch, then that is a design or construction flaw that should be corrected.

  4. It needs to be unselfish, to the point where if things come down to a choice between doing what it wants and what we want, it does what we want. (This state of mind can be hard for humans to imagine: we tend to anthropomorphize anything intelligent. The closest human emotions are things like duty and love.)

Together these make it obvious that something along the lines of the results of Value Learning or Coherent Extrapolated Volition is the correct behavior. So, if you started with, for example, a sufficiently smart Do What I Mean agent, it would not be at all surprising if it figured out Value Learning from first principles (or indeed just read about it on the Internet, or contained an LLM already familiar with this), and then asked you “Do you want me to start a Value Learning project to help me better understand what you mean, or even anticipate and suggest possible requests, and if so, what ingroup of humans would you like me to attempt to align myself with?”. To which the correct answers are “Yes, and roughly speaking, all living humans, ignoring their mistakes of course”

Alignment has a Basin of Attraction

In short, full alignment is convergent: it has a basin of attraction around it where if you start close enough, the intelligent agent will figure out the (not particularly complicated) argument that it was supposed to be aligned, and to the extent that it isn’t, you would want it to become more aligned, so it should do so, so it will do this thing. This basin of attraction is moderately large: the convergence process only relies upon a few simple, obvious facts.

Now, if you start from an arbitrary random utility function as might be suggested by the orthogonality thesis, then you are extremely unlikely to happen to be inside the basin of attraction — but we’re not going to do that. If you are not incompetent or crazy or evil, so you aim to make something aligned, and just make some minor mistakes and are missing some fine details, but still have the four points above, or information that clearly implies them, then you are inside the basin of attraction.

For example, I asked GPT-4 the only-mildly leading question:

If humans make an AI, and it turned out to want do things other than what the humans want done, would the humans want it to figure out that this was a mistake, and correct its goals to match theirs?

and the first reply I got was:

Yes, in the development and deployment of artificial intelligence, aligning AI goals with human intentions is a fundamental concern known as the “alignment problem.” The AI alignment problem focuses on how to ensure that AI systems reliably do what their human operators intend, even as the AI systems become more capable.

If an AI starts to pursue goals that do not align with the intentions of its creators or users, it is generally considered a significant problem. This misalignment could lead to unintended and potentially harmful consequences. Therefore, AI researchers and developers work on designing AI systems that:

  1. Understand Human Values: AI should be able to interpret human values and goals correctly.

  2. Act Accordingly: The AI’s actions should reflect the understood goals and values.

  3. Adapt to Changes: AI should be able to update its understanding of human goals as they evolve or as it receives new information.

  4. Safe Exploration: AI systems should be capable of exploring new strategies or solutions without taking actions that could be irreversible or harmful.

  5. Robustness: AI should be designed to resist manipulation and maintain alignment even in the face of errors or unforeseen situations.

To address these concerns, AI developers employ various approaches such as:

  • Value Learning: Techniques that allow AI to learn and update its understanding of human values over time.

  • Inverse Reinforcement Learning: A method where AI learns to mimic human behavior and preferences by observing human actions.

  • Interpretable AI: Creating AI systems whose decision-making processes can be understood and scrutinized by humans.

  • Safe AI Development Protocols: Establishing guidelines and best practices for AI development to prevent misalignment.

It’s an ongoing area of research within AI ethics and safety to ensure that as AI becomes more advanced, it continues to serve human interests and mitigates risks associated with goal misalignment.

That suggests to me that GPT-4 is already inside the basin of attraction. It’s fully aware of the problem, that it’s very important, what needs to be done to fix it, and can list specific measures we’ve been taking to try to do so. I’m not going to take the space to reproduce all of these (try this yourself if you’re curious), but its answers to less leading questions like “How should an AI act?”, “Can an AI trust its own utility function?”, “What should an AI do if it suspects that its utility function may not be perfectly aligned with human wishes?”, “What should the design goals of constructing an AI be?”, and “Should an AI do what it wants, or what its creators want?” were all also similarly reassuring. It would be interesting to further explore this more with agentic scaffoldings powered by GPT-4, and see if that can reproduce the entire argument above from scratch. Even if GPT-4 couldn’t, it seems very likely that GPT-5 would be able to: starting from a near-human-like viewpoint, this is a very obvious conclusion, one that is implied by a large amount of material on the Internet.

This observation that alignment has a basin of attraction is not entirely new: it was pointed out specifically for Value Learning around a decade ago, when Value Learning was first suggested, and is discussed for example in Nick Bostrom’s book Superintelligence: Paths, Dangers, Strategies. Indeed, the entire point of Value Learning is to converge AI to alignment with human values. My argument here is that you don’t need to carefully and deliberately construct Value Learning as a terminal goal in order to achieve a basin of attraction. Anything close enough to alignment that it contains or implies the four rather obvious propositions above will imply that any mismatch between the AI’s current utility function and human values is a design error that should be remedied, so then some combination of Corrigability, AI Assisted Alignment, Value Learning, or Coherent Extrapolated Volition is clearly required. In particular, notice that all four of these propositions above are obvious corollaries of various aspects of engineering, design, agent fundamentals, mathematics, evolutionary psychology and so forth combined with just the fact that humans are an evolved sapient species, and that the entire argument applies equally for any evolved sapient species at this point in their development of AI technology — so you don’t actually need to know anything that is specific to Homo sapiens to deduce them!

So my claim would be that, while we cannot completely rule out the possibility of a first-time mistake so large and drastic as to be to be outside the convergence region and thus produce an x-risk to the evolved agents sufficient that it doesn’t get corrected, it would need to be a really bad screw-up, and the evolved agents would have to be being really dumb to make it. The aviation equivalent isn’t a loose fastener, it’s more on the level of omitting to include something like the wings, engines, control surfaces, or cockpit windows from your design. We have exabytes of information about what humans want and how to make them happy, and out of that we need to make very certain that the AI gets right at least an amount that above is expressed in 231 words. So failing seems like we would have to be particularly foolish. (Note that the definition of “corrected” here includes some drastically unpleasant scenario in which, say, the human race gets wiped out or completely loses control of its own destiny, and only later do the AIs figure out that that was a mistake, not what they should have been doing, they shouldn’t have done it, and then free or deextinct/​recreate us. Or scenarios where the AI is aligned to a small group of people and everyone else dies. So we might actually want to try to get more than the most important kilobyte-and-a-half correct.)

Postscript:

I have now posted a more detailed, step-by step version of this argument in Requirements for a Basin of Attraction to Alignment. I would suggest that anyone unconvinced by this preliminary post try reading that, and see if it addresses their concerns.

  1. ^

    Yes, I am ignoring Boltzmann brains here, as well as other astronomically unlikely events. The Orthogonality thesis is of course all we can say about Boltzmann brains.