A brief review of the reasons multi-objective RL could be important in AI Safety Research
By Ben Smith, Roland Pihlakas, and Robert Klassert
Thanks to Linda Linsefors, Alex Turner, Richard Ngo, Peter Vamplew, JJ Hepburn, Tan Zhi-Xuan, Remmelt Ellen, Kaj Sotala, Koen Holtman, and Søren Elverlin for their time and kind remarks in reviewing this essay. Thanks to the organisers of the AI Safety Camp for incubating this project from its inception and for connecting our team.
For the last 9 months, we have been investigating the case for a multi-objective approach to reinforcement learning in AI Safety. Based on our work so far, we’re moderately convinced that multi-objective reinforcement learning should be explored as a useful way to help us understand ways in which we can achieve safe superintelligence. We’re writing this post to explain why, to inform readers of the work we and our colleagues are doing in this area, and invite critical feedback about our approach and about multi-objective RL in general.
We were first attracted to the multi-objective space because human values are inherently multi-objective—in any number of frames: deontological, utilitarian, and virtue ethics; egotistical vs. moral objectives; maximizing life values including hedonistic pleasure, eudaemonic meaning, or the enjoyment of power and status. AGI systems aiming to solve for human values are likely to be multi-objective themselves, if not by explicit design, then multi-objective systems would emerge from learning about human preferences.
As a first pass at technical research in this area, we took a commonly-used example, the “BreakableBottles” problem, and showed that for low-impact AI, an agent could more quickly solve this toy problem if it uses a conservative but flexible trade-off between alignment and performance values, compared to using a thresholded alignment system to maximize a certain amount of alignment and only then maximizing on performance. Such tradeoffs will be critical for understanding the conflicts between more abstract human objectives a human-preference-maximizing AGI would encounter.
To send feedback you can (a) contribute to discussion by commenting on this forum post; (b) send feedback anonymously; or (c) directly send feedback to Ben (benjsmith@gmail.com), Roland (roland@simplify.ee), or Robert (robertklassert@pm.me).
What is multi-objective reinforcement learning?
In reinforcement learning, an agent learns which actions lead to reward, and selects them. Multi-objective RL typically describes games in which an agent selects an action based on its ability to fulfill more than one objective. In a low-impact AI context, objectives might be “make money” and “avoid negative impacts on my environment”. At any point in time, an agent can assess the value of each action by its ability to fulfill each of these objectives. The values of each action in terms of each objective make up a value vector. An area of focus for research in multi-objective RL is how to combine that vector into a single scalar value representing the overall value of an action, to be compared to its alternatives, or sometimes, how to compare the vectors directly. Agents learn the consequences of each action in terms of each of their objectives, and actions are evaluated based on their consequences with respect to each objective. It has previously been argued (Vamplew et al., 2017) that human-aligned artificial intelligence is a multi-objective problem. Objectives can be combined through various means, such as achieving a thresholded minima for one objective and maximizing another, or through some kind of non-linear weighted combination of each objective into a single reward function. At a high level, this is as simple as combining the outputs from possibly transformed individual objective rewards and selecting an action based on that combination. Exploring ways to combine objectives in ways that embed principles we care about, like conservatism, is a primary goal for multi-objective RL research and a key reason we value multi-objective RL research distinct from single-objective RL.
It seems to us there are some good reasons to explore multi-objective RL for applications in aligning superintelligence. The three main reasons are the potential to reduce Goodharting, the parallels with biological human intelligence and broader philosophical and societal objectives, and the potential to better understand systems that may develop multiple objectives even from a single final goal. There are also some fundamental problems that need to be solved if it is to be useful, and although this essay primarily addresses potential opportunities, we touch on a few challenges we’ve identified in a section below. Overall, we think using pluralistic value systems in agents has so far received too little attention.
Multi-objective RL might be seen as an extension of Low-impact RL
Low-impact RL aims to set two objectives for an RL agent. The first, Primary objective is to achieve some goal. This could be any reasonable goal we want to achieve, such as maximizing human happiness, maximizing GDP, or just making a private profit. The second, Safety objective is to have as little impact as possible on things unrelated to the primary objective while the primary objective is achieved. Despite the names, the Safety objective has usually a higher priority than the Primary objective.
The low-impact approach lessens the risk of undesirable changes by punishing any change that is not an explicit part of the primary objective. There are many proposals for how to define a measure of low-impact, e.g., deviations from the default course of the world, irreversibility, relative reachability, or attainable utility preservation.
Like low-impact RL, multi-objective RL balances an agent’s objectives, but its aims are more expansive and it balances more than two objectives. We might configure a multi-objective RL with ‘Safety’ objectives as one or more objectives among many. But the aim, rather than to constrain a single objective with a Safety objective, is to constraint a range of objectives with each other. Additionally, multi-objective RL work often explores methods for non-linear scalarization, whereas low-impact RL work to-date has typically used linear scalarization (see Vamplew et al., 2021).
Possible forms of multi-objective RL Superintelligence
Balance priorities of different people—each person’s preferences is an independent objective (but see a previous solution for this).
Balance present and predicted priorities of a single individual—a problem Stuart Russell discusses in Human Compatible (pp 241-245).
Aim for a compromise or even intersection between multiple kinds of human values, preferences, or moral values.
Better model the preferences of biological organisms, including humans, which seem to be multi-objective
Balance different forms of human preferences where these are not aligned with each other, including explicitly expressed preferences and revealed preferences.
Reasons to implement multi-objective RL
We can reduce Goodharting by widening an agent’s focus. Multi-objective RL could optimize for a broader range of objectives, including, ‘human preference fulfillment’, ‘maximizing human happiness’, ‘maximizing human autonomy’, ‘equality or fairness’, ‘diversity’, ‘predictability, explainability, corrigibility, and interruptibility’, and so on. By doing so, each objective serves as a ‘safety objective’ against each of the others. For instance, with all of the above objections, we guard against the chance that, in the pursuit of achieving human preferences, an agent could cause suffering [that would violate the ‘maximize human happiness’ objective] or in the pursuit of human happiness, enslave humanity in a ‘gilded cage’ paradise [that would violate the ‘maximize human autonomy’ objective]. As we note below, the flipside might be that it’s harder to identify failure modes. If we identify all of the fundamental objectives we care about, we can configure an RL agent to learn to optimize each objective under the constraints of not harming the others. Although some might expect a single-objective agent that seeks to (for example) “fulfill human preferences” to find an appropriate balance between preferences, we are concerned that any instantiation of this goal concrete enough to be acted upon risks missing other important things we care about.
Even if a future super-intelligence is able to bootstrap up a broad model of human preferences to optimize for from a single well-chosen objective, some nearer-term AI will not be as capable. Considering how many practical, real-world problems are multi-objective, building in appropriate multiple objectives from the start could help near-term AI find heuristically-optimal solutions. An expanding set of work and implementations of nearer-term AI could provide groundwork for the design of transformational superintelligence.
Explicit human moral systems are multi-objective (Graham et al, 2013); there exists no one broadly-agreed philosophical framework for morality; human legal systems are multi-objective. From human psychology to the legal systems humans have designed, no one objective has sufficed for system design, and explicitly acknowledging the multiplicity of human values might be a good path forward for an aligned AI. To put this another way, it might be that no level of super-intelligence can figure out what we “really mean” by “maximize human preferences” because “human preferences” simply has no single referent; nor is there any single or objective way to reduce multiple references to a common currency. Subagents who ‘vote’ for different goals like members of a consensus-based committee have been previously proposed to model path-dependent preferences.
The priorities, preferences, life values, and basic physical needs and objectives of individual organisms are multi-objective (even though life/the evolutionary process is not), and in particular, human values—including but not limited to moral values—are themselves multi-objective.
It might be the case that for a superintelligent agent, it is sufficient to set one broad final goal, such as “maximize human preferences” and have multiple instrumental goals that derive from that using inverse reinforcement learning or other processes. But studying multi-objective decision-making might nevertheless be helpful for learning about how an agent could balance the multiple instrumental goals that it learns.
A single objective system such as “fulfill human preferences” might be impossible to implement in a desired fashion. As in the previous section on possible forms of multi-objective RL superintelligence, the preferred answer to questions like which humans’ preferences, and what kind of preferences may not be singular.
Possible problems with a research agenda in multi-objective RL
This essay primarily explores the case for exploring multi-objective RL in the context of AI Alignment and so we haven’t aspired to present a fully objective list of possible pros and cons. With that said, we have identified several potential problems we are concerned could threaten the relevance or usefulness of a multi-objective RL agenda. In particular, these might be best seen as plausibility problems. They could conceivably limit us from actually implementing a system that is capable of intelligently balancing multiple objectives.
With more objectives, there are exponentially more combinations of ways they could combine and balance out to yield a particular outcome. Consequently, it’s not clear that implementing multi-objective RL makes it easier to predict an agent’s behavior.
If there is strong conflict between values, an agent might have an incentive to modify or eliminate some of its own values in order to yield a higher overall expected reward.
The objective calibration problem: with multiple objectives representing, for instance, different values, how do we ensure that competing values are calibrated to an appropriate relative numerical scales to ensure one does not dominate over the others?
Why we think multi-objective RL research will be successful
Multi-objective RL is an already ongoing field of research in academia. Its focus is not primarily on AGI Alignment (although we’ll highlight a few researchers within the alignment community below), and we believe that if applied further in AGI Alignment, multi-objective RL research is likely to yield useful insight. Although the objective scale calibration problem, the wireheading problem, and others, are currently unsolved and are relevant to AGI Alignment, we see opportunities to make progress in these critical areas, including existing work that, in our view, makes progress on various aspects of the calibration problem (Vamplew 2021, Turner, Hadfield-Menell, Tadepalli, 2020). Peter Vamplew has been exploring multi-attribute approaches to low-impact AI and has demonstrated novel impact-based ways to trade off primary and alignment objectives. Alexander Turner and colleagues, working in the low-impact AI research, use a multi-objective space to build a conservative agent that prefers to preserve attainable rewards by avoiding actions that close off options. A key area of interest is exploring how to balance, in non-linear fashion, a set of objectives such that some intuitively appealing outcome is addressed, and our own workshop paper is one example of this.
Even if AGI could derive all useful human objectives through a single directive to “satisfy human preferences” as a single final goal, better understanding multi-objective RL will be useful for understanding how such an AGI might balance competing priorities. That is because human preferences are multi-objective, and so even a human-preference-maximizing agent will, in an emergent sense, become a multi-objective agent, developing multiple sub-objectives to fulfill. Consequently, studying explicitly multi-objective systems are likely to provide insight into how those objectives are likely to play off against one another.
Open questions in Multi-objective RL
There are a number of questions within multi-objective reinforcement learning that are interesting to explore: this is our first attempt at sketching out a research agenda for the area. Some of these questions, like the potential problems mentioned above, could represent risks that multi-objective RL turns out to be less relevant to AI Alignment. Others are interesting and important questions, important to know how to apply and build multi-objective RL but not decisive for its relevance to AI Alignment.
What is the appropriate way to combine multiple objectives? We have proposed a conservative non-linear transform but there are many ways to do this, and there are many other approaches as well.
To evaluate each action against its alternatives, should we take a combination of the action’s values with respect to each objective and compare that aggregated value to the equivalent metric for other actions, or should comparison between actions occur by comparing the values with respect to the objectives lexicographically, without aggregating them first? In other words, should there be some objectives that are always of higher priority than other objectives, at least until some threshold value is reached, regardless of the values computed in these other objectives?
Preventing wireheading of one or more objectives against the others. This is an important problem for single-objective RL as well, but for multi-objective RL, it’s particularly interesting, because with the wrong framework, each of the system’s own objectives create an incentive to potentially modify other objectives the system has. Would an RL agent have an incentive to turn off one or more of its objectives? There has been some previous work (Dewey, 2011; Demski, 2017; Kumar et al., 2020) but the matter seems unresolved.
Properly calibrating each objective is a key problem for multi-objective RL; setting some kind of relative scale between objectives is unavoidable. Is there a learning process that could be applied?
Establishing a zero-point or offset for each objective, for instance, a ‘default course of the world’ in case of low-impact objectives.
Should we apply a conservative approach to prioritize non-linearly, awarding exponentially more penalty for negative changes than we do award positive changes of the same magnitude? A concave non-linear transform can help to embed conservatism to ensure that downside risk on each objective really does constrain overzealous action motivated by other objectives.
One can represent a trade-off between balancing multiple objectives on a spectrum between linear expected utility through to maximizing Pareto optimality of nonlinear utility functions. Linear expected utility simply sums up values for each objective without transformation. What is the right balance?
Decision paralysis: with sufficiently many objectives, and sufficiently conservative tuning, an agent might never take an action. That might be a feature rather than a bug, but how would we utilize it? Instances of decision paralysis seem sometimes to be a good place for asking human feedback or choice.
Discounting the future.
Using both unbounded and upper bounded objectives simultaneously (the latter includes homeostatic objectives).
If using a nonlinear transformation, should it be applied to individual rewards at each timestep (as a utility function) versus to the Q values?
Which of these problems seems particularly important to you?
Some current work by us and others
We recently presented work on multi-objective reinforcement learning aiming to describe a concave non-linear transform that achieves a conservative outcome by magnifying possible losses more than possible gains at the Multi-objective Decision-Making Workshop 2021. A number of researchers presented various projects on multi-objective decision-making. Many of these could have broader relevance for AGI Alignment, and we believe the implications of work like this for AGI Alignment should be more explicitly explored. One particularly important relevant paper was “Multi-Objective Decision Making for Trustworthy AI” by Mannion, Heintz, Karimpanal, and Vamplew. The authors explore why multi-objective work makes an AI trustworthy; we believe their arguments likely apply as much for transformative AGI as they do for present-day AI systems.
In writing up our work, “Soft maximin approaches to Multi-Objective Decision-making for encoding human intuitive values”, we were interested in multi-objective decision-making because of the potential for an agent to balance conflicting moral priorities. To do this, we wanted to design an agent that would prioritize avoiding ‘moral losses’ over seeking ‘moral gains’, without being paralysed by inaction if all options involved tradeoffs, as moral choices so often do. So, we explored a conservative transformation function that prioritizes the avoidance of losses more than accruing gains, imposing diminishing returns on larger gains but computing exponentially larger negative utilities as costs grow larger.
This model incentivizes an agent to balance each objective conservatively. Past work had designed agents that use a thresholded value for its alignment objective, and only optimize for performance once it has become satisfactory on alignment. In many circumstances it might be desirable for agents to learn optimizing for both objectives simultaneously, and our method provides a way to do that, while actually yielding superior performance on alignment in some circumstances.
Current directions
Our group as well as many of the other presenters from that workshop are publishing our ideas in a special issue of the Autonomous Agents and Multi-Agent Systems, which comes out in April 2022.
We are currently exploring appropriate calibration for objectives in a set of toy problems introduced by Vamplew et al. (2021). In particular, we’re interested in the relative performance of a continuous non-linear transformation function compared to a discrete, thresholded transformation function on each of the tasks, as well as how performance in each of the functions is robust to variance in the task and its reward structure.
What do you think?
We invite critical feedback about our approach to this topic, about our potential research directions, and about the broad relevance of multi-objective reinforcement learning to AGI Alignment. We will be very grateful for any comments you provide below! Which of the open questions in multi-objective AI do you think are most compelling or important for AGI Alignment research? Do some seem irrelevant or trivial? Are there others we have missed that you believe are important?
- Consequentialism & corrigibility by 14 Dec 2021 13:23 UTC; 66 points) (
- 4. Existing Writing on Corrigibility by 10 Jun 2024 14:08 UTC; 47 points) (
- [Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation by 23 Mar 2022 12:48 UTC; 44 points) (
- The alignment stability problem by 26 Mar 2023 2:10 UTC; 25 points) (
- How teams went about their research at AI Safety Camp edition 5 by 28 Jun 2021 15:15 UTC; 24 points) (
- Scalar reward is not enough for aligned AGI by 17 Jan 2022 21:02 UTC; 13 points) (
- 2 Mar 2023 5:00 UTC; 1 point) 's comment on Clippy, the friendly paperclipper by (
Great post, thanks for writing it!!
The links to http://modem2021.cs.nuigalway.ie/ are down at the moment, is that temporary, or did the website move or something?
Is it fair to say that all the things you’re doing with multi-objective RL could also be called “single-objective RL with a more complicated objective”? Like, if you calculate the vector of values V, and then use a scalarization function S, then I could just say to you “Nope, you’re doing normal single-objective RL, using the objective function S(V).” Right?
(Not that there’s anything wrong with that, just want to make sure I understand.)
…this pops out at me because the two reasons I personally like multi-objective RL are not like that. Instead they’re things that I think you genuinely can’t do with one objective function, even a complicated one built out of multiple pieces combined nonlinearly. Namely, (1) transparency/interpretability [because a human can inspect the vector V], and (2) real-time control [because a human can change the scalarization function on the fly]. Incidentally, I think (2) is part of how brains work; an example of the real-time control is that if you’re hungry, entertaining a plan that involves eating gets extra points from the brainstem/hypothalamus (positive coefficient), whereas if you’re nauseous, it loses points (negative coefficient). That’s my model anyway, you can disagree :) As for transparency/interpretability, I’ve suggested that maybe the vector V should have thousands of entries, like one for every word in the dictionary … or even millions of entries, or infinity, I dunno, can’t have too much of a good thing. :-)
You can apply the nonlinear transformation either to the rewards or to the Q values. The aggregation can occur only after transformation. When transformation is applied to Q values then the aggregation takes place quite late in the process—as Ben said, during action selection.
Both the approach of transforming the rewards and the approach of transforming the Q values are valid, but have different philosophical interpretations and also have different experimental outcomes to the agent behaviour. I think both approaches need more research.
For example, I would say that transforming the rewards instead of Q values is more risk-averse as well as “fair” towards individual timesteps, since it does not average out the negative outcomes across time before exponentiating them. But it also results in slower learning by the agent.
Finally there is a third approach which uses lexicographical ordering between objectives or sets of objectives. Vamplew has done work on this direction. This approach is truly multi-objective in the sense that there is no aggregation at all. Instead the vectors must be compared during RL action selection without aggregation. The downside is that it is unwieldy to have many objectives (or sets of objectives) lexicographically ordered.
I imagine that the lexicographical approach and our continuous nonlinear transformation approaches are complementary. There could be for example two main sets of objectives: one set for alignment objectives, the other set for performance objectives. Inside a set there would be nonlinear transformation and then aggregation applied, but between the sets there would be lexicographical ordering applied. In other words there would be a hierarchy of objectives. By having only two sets in lexicographical ordering the lexicographical ordering does not become unwieldy.
This approach would be a bit analogous to the approach used by constraint programming, though more flexible. The safety objectives would act as a constraint against performance objectives. An approach that is almost in absurd manner missing from classical naive RL, but which is very essential, widely known, and technically developed in practical applications, that is, in constraint programming! In the hybrid approach proposed in the above paragraph the difference from classical constraint programming would be that among the safety objectives there would still be flexibility and ability to trade (in a risk-averse way).
Finally, when we say “multi-objective” then it does not just refer to the technical details of the computation. It also stresses the importance of acknowledging the need for researching and making more explicit the inherent presence and even structure of multiple objectives inside any abstract top objective. To encode knowledge in a way that constrains incorrect solutions but not correct solutions. As well as acknowledging the potential existence of even more complex, nonlinear interactions between these multiple objectives. We did not focus on nonlinear interactions between the objectives yet, but these interactions are possibly relevant in the future.
I totally agree that in a reasonable agent the objectives or target values / set-points do change, as it is also exemplified by biological systems.
Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
That’s right. What I mainly have in mind is a vector of Q-learned values V and a scalarization function that combines them in some (probably non-linear) way. Note that in our technical work, the combination occurs during action selection, not during reward assignment and learning.
I guess whether one calls this “multi-objective RL” is semantic. Because objectives are combined during action selection, not during learning itself, I would not call it “single objective RL with a complicated objective”. If you combined objectives during reward, then I could call it that.
re: your example of real-time control during hunger, I think yours is a pretty reasonable model. I haven’t thought about homeostatic processes in this project (my upcoming paper is all about them!). Definitely am not suggesting that our particular implementation of “MORL” (if we can call it that) is the only or even the best sort of MORL. I’m just trying to get started on understanding it! I really like the way you put it. It makes me think that perhaps the brain is a sort of multi-objective decision-making system with no single combinatory mechanism at all except for the emergent winner of whatever kind of output happens in a particular context—that could plausibly be different depending on whether an action is moving limbs, talking, or mentally setting an intention for a long term plan.
Thanks for writing this up! I support your call for more alignment research that looks more deeply at the structure of the objective/reward function. In general I feel that the reward function part of the alignment problem/solution space could use much more attention, especially because I do not expect traditional ML research community to look there.
Traditional basic ML research tends to abstract away from the problem of writing am aligned reward function: it all about investigating improvements to general-purpose machine learning, machine learning that can optimize for any possible ‘black box’ reward function R.
In the work you did, you show that this black box view of the reward function is too narrow. Once you open up the black box and treat the reward function as a vector, you can define additional criteria about how machine learning performance can be aligned or unaligned.
In general, I found that once you take the leap and start contemplating reward function design, certain problems of AI alignment can become much more tractable. To give an example: the management of self-modification incentives in agents becomes kind of trivial if you can add terms to the reward function which read out some physical sensors, see for example section 5 of my paper here.
So I have been somewhat puzzled by the question of why there is so little alignment research in this direction, or why so few people step up an point out that this kind of stuff is trivial. Maybe this is because improving the reward function is not considered to be a part of ML research. If I try to manage self-modification incentives with my hands tied behind my back, without being allowed to install physical sensors coupled to reward function terms, the whole problem becomes much less tractable. Not completely intractable, but the the solutions I then find (see this earlier paper ) are mathematicaly much more complex, and less robust under mistakes of machine learning.
I sometimes have the suspicion that there are whole non-ML conferences or bodies of literature devoted to alignment related reward function design, but I am just not seeing them. Unfortunately, it looks like the modem2021 workshop website with the papers you linked to is currently down. It was working two weeks ago.
So a general literature search related question: while doing your project, did you encounter any interesting conferences or papers that I should be reading, if I want to read more work on aligned reward function design? I have already read Human-aligned artificial intelligence is a multiobjective problem.
Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
The paper is now published with open access here:
https://link.springer.com/article/10.1007/s10458-022-09586-2
The only resource I’d recommend, beyond MODEM, when that’s back up, and our upcoming JAMAAS special issue, is to check out Elicit, Ought’s GPT-3-based AI lit search engine (yes, they’re teaching GPT-3 about how to create a superintelligent AI. hmm). It’s in beta, but if they waitlist you and don’t accept you in, email me and I’ll suggest they add you. I wouldn’t say it’ll necessarily show you research you’re not aware of, but I found it very useful for getting into the AI Alignment literature for the first time myself.
https://elicit.org/