Agency engineering: is AI-alignment “to human intent” enough?

TL;DR: We have not solved the “free will” problem: i.e. we do not know to what extent consciousness, culture, biology or a myriad of other factors shape human will and intent. Yet, prediction of future behavior of self and other agent-caused actions (i.e. what the self/​others intend) is the main driving force in the evolution of brains; AI technologies will parallel evolution by being increasingly aimed at human behavior prediction—because the most powerful tools humans can create are for prediction of human behavior rather than prediction of biological or physical processes. As such, humans will become targets for “agency engineering” from powerful behavior predicting technologies which can interact with our culture, political and social structures to engineer human will from the ground up; in these scenarios it’s not just AI-misuse and AI-misaligned agents but also AI-aligned agents that could weaken or harm us. Our limited progress and understanding w.r.t. to the science of agency makes us vulnerable to this type of harm. We need solutions soon, with one option being hard-coding rights and freedoms before we might be irreversibly stripped of power.

Background: At the core of my research interest lies a simple question: If another agent (i.e. person, algorithm etc) can predict what you are going to do next (second, hour, week, year), do you still have agency? This question was one that St Thomas Aquinas asked nearly 800 years ago (in Quaestiones disputatae de veritate), namely, whether free will or freedom can exist in the presence of an all-knowing agent—in his case, the Christian god. His answer was essentially yes—but mostly because his agent was not interested in interfering with human action. In an age of behavior-predicting AIs, there may be many such powerful agents. And some will be interested in interfering with human actions. The simplest question we can pose then is how do we: “preserve human free will around powerful systems?” (thanks to Charlie Steiner for suggesting this simple summary).

The problem framing—intention is vulnerable to manipulation: By humans (not) understanding “free will” I mean not understanding to what extent each of the forces in our lives affect or cause our free actions? Historically, the philosophers reduced the entire topic to “consciousness”, i.e. that our consciousness is the most important factor in deciding our actions and that it acts largely “independently” of other forces to determine action, especially on very important topics[1]. However, over the past few centuries sociology, psychology and a few other fields have uncovered many other factors that are causal in free action, including: cultural forces, societal norms and goals and political and social classes and structures.

The diagram below contains an attempt at summarizing all of humanities and science on the topic of human volitional (aka voluntary, self-controlled, free willed, etc) behavior. Briefly, here’s what each panel argues:

(a) the folk notion which many of us have is that our conscious will controls, or vets most of our important volitional actions (e.g. how much toothpaste to squirt onto my toothbrush, who to kill next etc.).

(b) when zooming out from folk intuitions, sociology, psychology, (neuro)biology and a bunch of other fields suggest volitional action is generated from many interacting factors. These factors include external ones including “other agents” (OAs) in the world as well as standards, rules and goals of our cultures (“SRG”s).

(c) current(-ish) technologies and media are playing an increasing role in shaping our flow of information and our choices; there is risk of abuse (e.g. “AI-misuse”) if unchecked OAs exert too much harmful influence.

(d) the “boiling frog” condition where human beings “feel” in control of their actions, but many (or all) of their goals are engineered or selected to support some OA’s goals—with the selection of how to achieve pre-selected goals being rendered meaningless; such OAs control much of the flow of information (insert favorite analogy about rich, political classes etc here).

(e*) (*not on the chart because anything can happen after we strip human beings of de facto agency).

The specific problem of aligning to “human intent”: I hope the above is not too off-putting. Some parts are probably wrong, but the overall point is hopefully not as wrong: as individual humans we (i.e. our conscious and reasoning skills) are at best[2] only good at deciding how to achieve our goals, desires or needs, and our language, culture, status etc. exhibit much influence over which desires and actions to pursue. There is a very large amount of literature on this. For example, the study of “attachment to group identity” is a field on its own. And after many studies it’s been shown that it is so powerful, that it can overwrite our core identities and opinions: i.e. if you tell a progressive person that their group cast a political vote for a conservative policy—they will likely also vote for the conservative policy despite it being contrary to their core belief systems (and vice versa)[3]. The—hopefully not lost—point is that we (i.e. our consciousness, reasons etc.) are not as much in charge of our free actions as we think.

If it’s not obvious, AI-misuses and AI-misaligned agents will harm human agency. It is already happening. And it will get worse (for a dystopia unfolding now, in real time, see [4]). The only technical question remaining is whether designing AI-aligned agents that follow human intent is enough to avoid harm to human agency. The diagram below argues this is not possible: human intent is itself dynamic and could be a target for change (or “manipulation”). Here’s the description:

(a) Currently attempts at developing AI/​AGIs that understand and carry out human intent and goals (blue path); and attempts at preventing entities (corps, governments etc) from intentionally developing harmful AIs (red path; diagram adapted from P Christiano talk).

(b) The bypassing path to changing intent: both AI-aligned and AI-”safe-use” technologies can be “aligned to intent” while intentionally or unintentionally changing human intent via the broader environment through forces acting directly on intent (the loopy arrows).

If we think of human intent or goals as “what do we, humans, want now!”, then in designing human-intent aligned AIs we may have a circular argument problem. In particular, our AI technologies will interact with our societies and cultures and necessarily change what we want or desire as a feature—not as a failure mode. A rambling example: if we task an AGI with solving “cancer” or “social inequality”, then maybe along the way it will act to change our thresholds for acceptable solutions. And solutions offered might be those that lie near some edge of moral dilemmas: we need to stop certain people from reproducing (or killing them or creating putting them into ignorable subclasses); or carrying out certain types of questionable experiments; or that we need to be monitored extremely precisely to reduce the risks of cancers; or become (perhaps too quickly) merged with our technologies; or that we need to automate our lives; or that powerful religious beliefs or systems of thought are a better way to live our lives; or that we should design better psychedelic/​escapist substances. I’m not great at science fiction, but there are many ways in which things can go wrong.

For clarity, I am claiming 3 things (thanks to a recent meeting with an AI-safety researcher for helping me clarify this):

  1. that AI-misuse can lead to intent/​agency manipulation; this is already happening (the red loopy lines); will get exponentially worse with biometric+neural datasets; this is likely enough to harm humanity significantly; but, 1. is probably not that controversial (and technical AI-alignment researchers don’t seem as interested in this class of problems).

  2. that AI-misaligned agents can target agency as well (e.g. culture, social movements etc; the blue loopy lines coming from a putatively misaligned-AI); again, this is not that radical (and others have written about more eloquently); this type of safety issue is somewhat related to wireheading; and 2. is probably more controversial than 1, but still acceptable.

  3. that AI-aligned agents to “human intent” will also venture into agency and intention hacking as a feature—not as a failure mode (the blue loopy lines coming from an AI-aligned to human intent); 3. is probably the most controversial.

Point 3. may look a bit like a trivial problem. One response could be that “truly aligned-AI” would not seek to change human intent. But it’s not clear how this addresses the core problem: human intent is created from (and dependent on) societal structures. Because we lack an understanding of the genesis of human actions/​intentions or goals—we cannot properly specify how human intent is constructed—and how to protect it from interference/​manipulation. A world imbued with AI-techs will change the societal landscape significantly and potentially for the worse. The folk notion is that human “intention” is a faculty that humans have which acts on the world and is somehow isolated or protected from the physical and cultural world (see Fig 1a). This model is too simplistic because humans are embedded agents. MIRI has a great blog about “dualistic” vs. “embedded” agents[5] - and their example is orders of magnitude simpler than the real human world.

A more specific problem to AI-alignment: how would humans evaluate this situation? For example, given an aligned-AI solution is it acceptable because (a) it’s a good solution or (b) because our moral values were changed by the aligned-AI in the process of seeking a solution? This type of harm is perhaps easier to visualize as AI-misuse: if an entity decided to dump massive amounts of information, power, money etc, driven by AI technologies to encourage the public adoption of some policy X, how would we stop them? And more importantly, in a future where media and all our information streams are guided solely by AI-technology, would we know what counts as manipulation, culture change or valid argumentation? And how would we evaluate whether this is done by an aligned-AI for our benefit or to harm us?

A way backwards: Perhaps a solution to this dilemma is to decide what human beings value—and hard-code them. Like really hard! If we want to protect certain rights maybe we need to decide that they are inviolable and be ok to sanction, go to war etc, when other countries violate them—even on their own people.[4] This may seem like a regressive step: fixing moral norms, but perhaps in some near-term future, we may need to think more carefully about acting on advice of powerful “entities” when the entirety of moral and cultural systems are at stake.

Final rant: I’ve tried to present the problem of “agency hacking” as brief as possible to elicit more reads and hopefully (!) constructive criticisms. I hope the lack of citations isn’t off-putting (I am writing the many page version of this article as well … to be completed at some point).

My own broader view is that the AI-technologies will increasingly focus on behavior prediction to parallel prediction as the main driving force in the evolution of brains. This is because the most powerful tools we can build are for predicting (and controlling) human beings not nature, biology, physics etc. And once human behavior prediction is good enough and we are embedded enough with our technologies it may be trivial to design individualized behavior manipulation—and agency engineering technologies. Those techs will be great if they help us become better and happier (etc.). We need to plan better to ensure this is the world we get to.

  1. ^

    The philosophical arguments fall into “compatibilism (60%), libertarianism, or no free will”. https://​​philpapers.org/​​surveys/​​results.pl.

  2. ^

    Many view that there is no real choice and the free will problem is in fact solved.

  3. ^

    Sorry for the academic like link, but more can be read about this field in this academic review: https://​​www.annualreviews.org/​​doi/​​10.1146/​​annurev-polisci-042016-024408.

  4. ^

    AI technologies trained on data obtained from oppressed populations already occurring in at least one place.

    “The process Byler describes involves forced harvesting of people’s data and then using it to improve predictive artificial intelligence technology. It also involves using the same population as test subjects for companies developing new tech. In other words, Xinjiang serves as an incubator for new tech.”

    https://​​www.cbc.ca/​​radio/​​ideas/​​china-s-high-tech-repression-of-uyghurs-is-more-sinister-and-lucrative-than-it-seems-anthropologist-says-1.6352535

  5. ^