Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Consequences of Misaligned AI (Simon Zhuang et al) (summarized by Flo): One intuition for why powerful AI systems might lead to bad consequences goes as follows:

1) Humans care about many attributes of the world and we would likely forget some of these when trying to list them all.

2) Improvements along these attributes usually require resources, and gaining additional resources often requires sacrifices along some attributes.

3) Because of 1), naively deployed AI systems would only optimize some of the attributes we care about, and because of 2) this would lead to bad outcomes along the other attributes.

This paper formalizes this intuition in a model, identifies conditions for when deploying AIs can reduce true utility within the model and proposes two mitigation strategies, impact minimization and interactivity.

We assume that the world state consists of L attributes, all of which the human cares about having more of, that is, true utility is strictly increasing in each of the attributes. Each attribute has some minimum value, and can be increased from that minimum value through the use of a fixed, finite resource (which you could think of as money, if you want); this allows us to formalize (2) above. To formalize (1), we assume that the proxy utility optimized by the AI is only allowed to depend on J<L of the attribute dimensions.

Given this setup, the paper proves that if the AI maximizes the proxy utility, then all attributes that were omitted in the proxy utility will be set to their minimal value. This will be worse than not using the AI system at all if 1) the minimum values of attributes are sufficiently small (allowing the AI to cause damage), 2) the resource cost (resp. gain in true utility) for increasing an attribute is independent of the other attributes’ level, 3) it always costs at least K resources to get a unit increase in any attribute, for some K > 0, and 4) utility has diminishing marginal returns in each attribute (and marginal returns tend to zero as the attribute increases).

Regarding mitigation, impact minimization requires that the AI keep all attributes that are omitted by the proxy constant. In this case, any gains in proxy utility must also be gains in true utility.

Meanwhile, in the interactive condition, the human gets to regularly select a new proxy (still only specifying J<L weights), or can choose to turn the AI off. Whether or not this is helpful depends on the AI’s optimization strategy and the frequency of human interventions: If the AI is “efficient” in the sense that it changes attributes as little as possible for any fixed gain in proxy utility, the human can choose a proxy that guarantees that locally, increases in the proxy correspond to increases in true utility. The strategy is to choose the attributes that are most sensitive to changes in resources (i.e. with the largest marginal returns) at the current state, and define the proxy to grow in these attributes as much as the true utility. As long as the human provides new proxies frequently enough to prevent the local guarantee from breaking, optimizing the proxy increases human utility.

We can also combine interactivity and impact minimization: in this case, the human should choose proxy utility functions that contain the most and least sensitive attributes (i.e. largest and smallest marginal returns) for the given state. The AI will then transfer some resources from the least sensitive attributes to the most sensitive attributes, while holding all other attributes fixed, leading to a guaranteed increase in true utility. In fact, it is possible to prove that this will converge to the maximum possible true utility.

Flo’s opinion: This is close to an informal model I’ve had for a while and I am glad that it got formalized including theoretical results. I find it interesting that the frequency of updates to the proxy matters even if movement in the state space is reversible. As the authors mention, it is also crucial that the AI’s actions don’t hinder the human’s ability to update the proxy, and I imagine that frequent updates to the proxy would be important for that as well in many cases.

Rohin’s opinion: This is a nice formalization of several important conceptual points in the AI alignment literature:

1. If you forget to specify something you care about, it will usually be set to extreme values (Of Myths and Moonshine). In particular, the AI system will extract any resources that were being used for that attribute, and apply them elsewhere (The Basic AI Drives (AN #107), Formalizing convergent instrumental goals)

2. Given that perfect information is impossible, interactivity becomes important (Human-AI Interaction (AN #41), Incomplete Contracting and AI Alignment (AN #3)).

3. Conservatism (in this case through impact regularization) can be helpful (see the many blog posts and papers on mild optimization, low impact, and conservatism).

TECHNICAL AI ALIGNMENT

HANDLING GROUPS OF AGENTS

Social choice ethics in artificial intelligence (Seth D Baum) (summarized by Rohin): If we want to program ethics into an AI system, should we do so by aggregating the ethical views of existing humans? This is often justified on procedural grounds: “everyone gets to affect the outcome”, or by abstention: “AI designers don’t have to think about ethics; the AI will deal with that”. (There is also a wisdom of the crowds justification, though this presupposes that there is some notion of “better” ethics independent of humans; which is out of scope for the paper.)

However, actually implementing an aggregative procedure requires three major design decisions: 1) standing, that is, whose views should be aggregated, 2) measurement, that is, how we determine what their ethical views are, and 3) aggregation, that is, how the views are put together into a whole. All of these are challenging.

For standing, we have to determine whom to include. Should we include children, psychopaths, non-human animals, ecosystems, future generations, and other AI systems? We must determine this ahead of time, since once we have decided on a social choice system, that system will then determine whose preferences are counted—we can’t just modify it later.

For measurement, we have to back out human values somehow, which is quite a challenge given that humans have all sorts of cognitive biases and give different answers depending on the context. (See also ambitious value learning (AN #31) and subsequent posts in the sequence.)

For aggregation, the problems are well known and studied in the field of social choice theory. Some famous impossibility results include Arrow’s theorem and the Gibbard-Satterthwaite theorem.

Rohin’s opinion: I see this paper as a well-organized literature review of the many reasons why you don’t want to handle AI alignment by finding the “true human utility function” or the “aggregated preferences of humanity” and then encoding them into the AI: there’s a myriad of challenges in even finding such an object. (A separate objection, out of scope for this paper, is that even if we did have such an object, we don’t know how to encode that goal into an AI system.)

You might then reasonably ask what we should be doing instead. I see the goal of AI alignment as figuring out how, given a fuzzy but relatively well-specified task, to build an AI system that is reliably pursuing that task, in the way that we intended it to, but at a capability level beyond that of humans. This does not give you the ability to leave the future in the AI’s hands, but it would defuse the central (to me) argument for AI risk: that an AI system might be adversarially optimizing against you. (Though to be clear, there are still other risks (AN #50) to consider.)

MISCELLANEOUS (ALIGNMENT)

Non-Obstruction: A Simple Concept Motivating Corrigibility (Alex Turner) (summarized by Rohin): The Reframing Impact sequence (AN #68) suggests that it is useful to think about how well we could pursue a range of possible goals; this is called the attainable utility (AU) landscape. We might think of a superintelligent AI maximizing utility function U as causing this landscape to become “spiky”—the value for U will go up, but the value for all other goals will go down. If we get this sort of spikiness for an incorrect U, then the true objective will have a very low value.

Thus, a natural objective for AI alignment research is to reduce spikiness. Specifically, we can aim for non-obstruction: turning the AI on does not decrease the attainable utility for any goal in our range of possible goals. Mild optimization (such as quantilization (AN #48)) reduces spikiness by reducing the amount of optimization that an AI performs. Impact regularization aims to find an objective that, when maximized, does not lead to too much spikiness.

One particular strategy for non-obstruction would be to build an AI system that does not manipulate us, and allows us to correct it (i.e. modify its policy). Then, no matter what our goal is, if the AI system starts to do things we don’t like, we would be able to correct it. As a result, such an AI system would be highly non-obstructive. This property where we can correct the AI system is corrigibility. Thus, corrigibility can be thought of as a particular strategy for achieving non-obstruction.

It should be noted that all of the discussion so far is based on actual outcomes in the world, rather than what the agent was trying to do. That is, all of the concepts so far are based on impact rather than intent.

Rohin’s opinion: Note that the explanation of corrigibility given here is in accord with the usage in this MIRI paper, but not to the usage in the iterated amplification sequence (AN #35), where it refers to a broader concept. The broader concept might roughly be defined as “an AI is corrigible when it leaves its user ‘in control’”; see the linked post for examples of what ‘in control’ involves. (Here also you can have both an impact- and intent-based version of the definition.)

On the model that AI risk is caused by utility maximizers pursuing the wrong reward function, I agree that non-obstruction is a useful goal to aim for, and the resulting approaches (mild optimization, low impact, corrigibility as defined here) make sense to pursue. I do not like this model much (AN #44), but that’s (probably?) a minority view.

Mapping the Conceptual Territory in AI Existential Safety and Alignment (Jack Koch) (summarized by Rohin): There are a bunch of high-level overviews and research agendas, not all of which agree with each other. This post attempts to connect and integrate several of these, drawing heavily on Paul Christiano’s overview (AN #95), my overview, and the ARCHES agenda (AN #103), but also including a lot of other work. It serves as a good way of connecting these various perspectives; I recommend reading it for this reason. (Unfortunately, it is rather hard to summarize, so I haven’t done so.)

AI safety: state of the field through quantitative lens (Mislav Juric et al) (summarized by Rohin): This paper presents data demonstrating growth in various subfields related to AI safety. The data was collected through queries to databases of papers and (presumably) reporting the number of results that the query returned.

Rohin’s opinion: The sharpest increases in the graphs seem to be in interpretability and explainable AI around 2017-18, as well as in value alignment starting in 2017. My guess is that the former is the result of DARPA’s interest in the area (which I believe started in 2016), and the latter is probably a combination of the founding of the Center for Human-Compatible AI (CHAI) and the publication and promotion of CIRL (AN #69) (one of CHAI’s early papers).

Surprisingly to me, we don’t see trend deviations in papers on “reward hacking”, “safe exploration”, or “distributional shift” after the publication of Concrete Problems in AI Safety, even though it has been cited way more often than CIRL, and seemed like it had far more of an effect on mainstream AI researchers. (Note that “safe exploration” did increase, but it seems in line with the existing trend.)

Note that I expect the data source is not that reliable, and so I am not confident in any of these conclusions.

AI GOVERNANCE

Society-in-the-loop: programming the algorithmic social contract (Iyad Rahwan) (summarized by Rohin): Earlier in this newsletter we saw arguments that we should not build AI systems that are maximizing “humanity’s aggregated preferences”. Then how else are we supposed to build AI systems that work well for society as a whole, rather than an individual human? When the goal of the system is uncontested (e.g. “don’t crash”), we can use human-in-the-loop (HITL) algorithms where the human provides oversight; this paper proposes that for contested goals (e.g. “be fair”) we should put society in the loop (SITL), through algorithmic social contracts.

What is a social contract? A group of stakeholders with competing interests have a (non-algorithmic) social contract when they “agree” to allow use of force or social pressure to enforce some norm that guards people’s rights and punishes violators. For example, we have a social contract against murder, which legitimates the use of force by the government in order to punish violators.

In an algorithmic social contract, the norms by which the AI system operates, and the goals which it pursues, are determined through typical social contracts amongst the group of stakeholders that care about the AI system’s impacts. Notably, these goals and norms can change over time, as the stakeholders see what the AI system does. Of course, this all happens on relatively long timescales; more immediate oversight and control of the AI system would have to be done by specific humans who are acting as delegates of the group of stakeholders.

The paper then goes into many open challenges for creating such algorithmic social contracts: How does society figure out what goals the AI system should pursue? How do we deal with externalities and tradeoffs? How can these fuzzy values be translated into constraints on the AI system? It provides an overview of some approaches to these problems.

Rohin’s opinion: I really like the notion of an algorithmic social contract: it much better captures my expectation of how AI systems will be integrated into society. With this vocabulary, I would put technical AI alignment research squarely in the last category, of how we translate fuzzy values that society agrees on into constraints on the AI system’s behavior.

Fragmentation and the Future: Investigating Architectures for International AI Governance (Peter Cihon et al) (summarized by Rohin): Should AI governance be done centrally, through an international body, or in a fragmented, decentralized fashion? This paper identifies various considerations pointing in different directions:

1. Centralized institutions can have more political power when designed well: their regulations can have more “teeth”.

2. Centralized institutions can be more efficient from the participant’s perspectives: if there is only one set of regulations, it is much easier for each participant to adhere to those regulations.

3. A centralized institution will typically be slower to act, as there are many more parties with a larger stake in the outcome. This can make it brittle, especially when the pace of technological change outpaces that of regulatory change.

4. Centralized institutions face a breadth vs. depth dilemma: if the regulations are too stringent, then some actors (i.e. nations, companies, etc) won’t participate (there is depth but not breadth), and similarly, to get everyone to participate the regulations must often be quite weak (breadth but not depth). In contrast, with decentralized approaches, the depth of the regulations can be customized to each participant.

5. With more fragmented approaches, actors can “forum shop” for the regulations which they think are best. It is unclear whether this is helpful or harmful for AI governance.

6. It is unclear which approach leads to more coordination. While a centralized approach ensures that everyone has the same policies, leading to policy coherence, it does not necessarily mean that those policies are good. A decentralized approach could lead to faster adaptation leading to better policies that are then copied by others, leading to more effective coordination overall.

FEEDBACK

I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

[AN #131]: Formalizing the argument of ignored attributes in a utility function