Coherence arguments do not entail goal-directed behavior
One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.
We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent, since if we could notice these failures, so could the agent. So we should expect that powerful agents appear coherent to us. (Note that it is possible that the agent doesn’t fix the failures because it would not be worth it—in this case, the argument says that we will not be able to notice any exploitable failures.)
Taken together, these arguments suggest that we should model an agent much smarter than us as an expected utility (EU) maximizer. And many people agree that EU maximizers are dangerous. So does this mean we’re doomed? I don’t think so: it seems to me that the problems about EU maximizers that we’ve identified are actually about goal-directed behavior or explicit reward maximizers. The coherence theorems say nothing about whether an AI system must look like one of these categories. This suggests that we could try building an AI system that can be modeled as an EU maximizer, yet doesn’t fall into one of these two categories, and so doesn’t have all of the problems that we worry about.
Note that there are two different flavors of arguments that the AI systems we build will be goal-directed agents (which are dangerous if the goal is even slightly wrong):
Simply knowing that an agent is intelligent lets us infer that it is goal-directed. (EDIT: See these comments for more details on this argument.)
Humans are particularly likely to build goal-directed agents.
I will only be arguing against the first claim in this post, and will talk about the second claim in the next post.
All behavior can be rationalized as EU maximization
Suppose we have access to the entire policy of an agent, that is, given any universe-history, we know what action the agent will take. Can we tell whether the agent is an EU maximizer?
Actually, no matter what the policy is, we can view the agent as an EU maximizer. The construction is simple: the agent can be thought as optimizing the utility function U, where U(h, a) = 1 if the policy would take action a given history h, else 0. Here I’m assuming that U is defined over histories that are composed of states/observations and actions. The actual policy gets 1 utility at every timestep; any other policy gets less than this, so the given policy perfectly maximizes this utility function. This construction has been given before, eg. at the bottom of page 6 of this paper. (I think I’ve seen it before too, but I can’t remember where.)
But wouldn’t this suggest that the VNM theorem has no content? Well, we assumed that we were looking at the policy of the agent, which led to a universe-history deterministically. We didn’t have access to any probabilities. Given a particular action, we knew exactly what the next state would be. Most of the axioms of the VNM theorem make reference to lotteries and probabilities—if the world is deterministic, then the axioms simply say that the agent must have transitive preferences over outcomes. Given that we can only observe the agent choose one history over another, we can trivially construct a transitive preference ordering by saying that the chosen history is higher in the preference ordering than the one that was not chosen. This is essentially the construction we gave above.
What then is the purpose of the VNM theorem? It tells you how to behave if you have probabilistic beliefs about the world, as well as a complete and consistent preference ordering over outcomes. This turns out to be not very interesting when “outcomes” refers to “universe-histories”. It can be more interesting when “outcomes” refers to world states instead (that is, snapshots of what the world looks like at a particular time), but utility functions over states/snapshots can’t capture everything we’re interested in, and there’s no reason to take as an assumption that an AI system will have a utility function over states/snapshots.
There are no coherence arguments that say you must have goal-directed behavior
Not all behavior can be thought of as goal-directed (primarily because I allowed the category to be defined by fuzzy intuitions rather than something more formal). Consider the following examples:
A robot that constantly twitches
The agent that always chooses the action that starts with the letter “A”
The agent that follows the policy <policy> where for every history the corresponding action in <policy> is generated randomly.
These are not goal-directed by my “definition”. However, they can all be modeled as expected utility maximizers, and there isn’t any particular way that you can exploit any of these agents. Indeed, it seems hard to model the twitching robot or the policy-following agent as having any preferences at all, so the notion of “exploiting” them doesn’t make much sense.
You could argue that neither of these agents are intelligent, and we’re only concerned with superintelligent AI systems. I don’t see why these agents could not in principle be intelligent: perhaps the agent knows how the world would evolve, and how to intervene on the world to achieve different outcomes, but it does not act on these beliefs. Perhaps if we peered into the inner workings of the agent, we could find some part of it that allows us to predict the future very accurately, but it turns out that these inner workings did not affect the chosen action at all. Such an agent is in principle possible, and it seems like it is intelligent.
(If not, it seems as though you are defining intelligence to also be goal-driven, in which case I would frame my next post as arguing that we may not want to build superintelligent AI, because there are other things we could build that are as useful without the corresponding risks.)
You could argue that while this is possible in principle, no one would ever build such an agent. I wholeheartedly agree, but note that this is now an argument based on particular empirical facts about humans (or perhaps agent-building processes more generally). I’ll talk about those in the next post; here I am simply arguing that merely knowing that an agent is intelligent, with no additional empirical facts about the world, does not let you infer that it has goals.
As a corollary, since all behavior can be modeled as maximizing expected utility, but not all behavior is goal-directed, it is not possible to conclude that an agent is goal-driven if you only know that it can be modeled as maximizing some expected utility. However, if you know that an agent is maximizing the expectation of an explicitly represented utility function, I would expect that to lead to goal-driven behavior most of the time, since the utility function must be relatively simple if it is explicitly represented, and simple utility functions seem particularly likely to lead to goal-directed behavior.
There are no coherence arguments that say you must have preferences
This section is another way to view the argument in the previous section, with “goal-directed behavior” now being operationalized as “preferences”; it is not saying anything new.
Above, I said that the VNM theorem assumes both that you use probabilities and that you have a preference ordering over outcomes. There are lots of good reasons to assume that a good reasoner will use probability theory. However, there’s not much reason to assume that there is a preference ordering over outcomes. The twitching robot, “A”-following agent, and random policy agent from the last section all seem like they don’t have preferences (in the English sense, not the math sense).
Perhaps you could define a preference ordering by saying “if I gave the agent lots of time to think, how would it choose between these two histories?” However, you could apply this definition to anything, including eg. a thermostat, or a rock. You might argue that a thermostat or rock can’t “choose” between two histories; but then it’s unclear how to define how an AI “chooses” between two histories without that definition also applying to thermostats and rocks.
Of course, you could always define a preference ordering based on the AI’s observed behavior, but then you’re back in the setting of the first section, where all observed behavior can be modeled as maximizing an expected utility function and so saying “the AI is an expected utility maximizer” is vacuous.
Convergent instrumental subgoals are about goal-directed behavior
One of the classic reasons to worry about expected utility maximizers is the presence of convergent instrumental subgoals, detailed in Omohundro’s paper The Basic AI Drives. The paper itself is clearly talking about goal-directed AI systems:
To say that a system of any design is an “artificial intelligence”, we mean that it has goals which it tries to accomplish by acting in the world.
It then argues (among other things) that such AI systems will want to “be rational” and so will distill their goals into utility functions, which they then maximize. And once they have utility functions, they will protect them from modification.
Note that this starts from the assumption of goal-directed behavior and derives that the AI will be an EU maximizer along with the other convergent instrumental subgoals. The coherence arguments all imply that AIs will be EU maximizers for some (possibly degenerate) utility function; they don’t prove that the AI must be goal-directed.
Goodhart’s Law is about goal-directed behavior
A common argument for worrying about AI risk is that we know that a superintelligent AI system will look to us like an EU maximizer, and if it maximizes a utility function that is even slightly wrong we could get catastrophic outcomes.
By now you probably know my first response: that any behavior can be modeled as an EU maximizer, and so this argument proves too much, suggesting that any behavior causes catastrophic outcomes. But let’s set that aside for now.
The second part of the claim comes from arguments like Value is Fragile and Goodhart’s Law. However, if we consider utility functions that assign value 1 to some histories and 0 to others, then if you accidentally assign a history where I needlessly stub my toe a 1 instead of a 0, that’s a slightly wrong utility function, but it isn’t going to lead to catastrophic outcomes.
The worry about utility functions that are slightly wrong holds water when the utility functions are wrong about some high-level concept, like whether humans care about their experiences reflecting reality. This is a very rarefied, particular distribution of utility functions, that are all going to lead to goal-directed or agentic behavior. As a result, I think that the argument is better stated as “if you have a slightly incorrect goal, you can get catastrophic outcomes”. And there aren’t any coherence arguments that say that agents must have goals.
Wireheading is about explicit reward maximization
There are a few papers that talk about the problems that arise with a very powerful system with a reward function or utility function, most notably wireheading. The argument that AIXI will seize control of its reward channel falls into this category. In these cases, typically the AI system is considering making a change to the system by which it evaluates goodness of actions, and the goodness of the change is evaluated by the system after the change. Daniel Dewey argues in Learning What to Value that if the change is evaluated by the system before the change, then these problems go away.
I think of these as problems with reward maximization, because typically when you phrase the problem as maximizing reward, you are maximizing the sum of rewards obtained in all timesteps, no matter how those rewards are obtained (i.e. even if you self-modify to make the reward maximal). It doesn’t seem like AI systems have to be built this way (though admittedly I do not know how to build AI systems that reliably avoid these problems).
In this post I’ve argued that many of the problems we typically associate with expected utility maximizers are actually problems with goal-directed agents or with explicit reward maximization. Coherence arguments only entail that a superintelligent AI system will look like an expected utility maximizer, but this is actually a vacuous constraint, and there are many potential utility functions for which the resulting AI system is neither goal-directed nor explicit-reward-maximizing. This suggests that we could try to build AI systems of this type, in order to sidestep the problems that we have identified so far.
- On Deference and Yudkowsky’s AI Risk Estimates by 19 Jun 2022 14:35 UTC; 262 points) (EA Forum;
- Survey on AI existential risk scenarios by 8 Jun 2021 17:12 UTC; 154 points) (EA Forum;
- 2018 Review: Voting Results! by 24 Jan 2020 2:00 UTC; 135 points) (
- 2018 Review: Voting Results! by 24 Jan 2020 2:00 UTC; 135 points) (
- Disentangling arguments for the importance of AI safety by 21 Jan 2019 12:41 UTC; 131 points) (
- why assume AGIs will optimize for fixed goals? by 10 Jun 2022 1:28 UTC; 131 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- Why should ethical anti-realists do ethics? by 16 Feb 2023 16:27 UTC; 118 points) (EA Forum;
- Ask (Everyone) Anything — “EA 101” by 5 Oct 2022 10:17 UTC; 110 points) (EA Forum;
- There are no coherence theorems by 20 Feb 2023 21:25 UTC; 90 points) (
- Coherence arguments imply a force for goal-directed behavior by 26 Mar 2021 16:10 UTC; 89 points) (
- There are no coherence theorems by 20 Feb 2023 21:52 UTC; 83 points) (EA Forum;
- Environmental Structure Can Cause Instrumental Convergence by 22 Jun 2021 22:26 UTC; 71 points) (
- Literature Review on Goal-Directedness by 18 Jan 2021 11:15 UTC; 71 points) (
- Why Would AI “Aim” To Defeat Humanity? by 29 Nov 2022 19:30 UTC; 68 points) (
- 16 Nov 2018 21:20 UTC; 65 points)'s comment on Clarifying “AI Alignment” by (
- Disentangling arguments for the importance of AI safety by 23 Jan 2019 14:58 UTC; 63 points) (EA Forum;
- Survey on AI existential risk scenarios by 8 Jun 2021 17:12 UTC; 63 points) (
- Consequentialism & corrigibility by 14 Dec 2021 13:23 UTC; 61 points) (
- Will humans build goal-directed agents? by 5 Jan 2019 1:33 UTC; 60 points) (
- When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives by 9 Aug 2021 17:22 UTC; 53 points) (
- Conclusion to the sequence on value learning by 3 Feb 2019 21:05 UTC; 49 points) (
- Coherent behaviour in the real world is an incoherent concept by 11 Feb 2019 17:00 UTC; 49 points) (
- Towards a Mechanistic Understanding of Goal-Directedness by 9 Mar 2021 20:17 UTC; 45 points) (
- Seeking Power is Convergently Instrumental in a Broad Class of Environments by 8 Aug 2021 2:02 UTC; 44 points) (
- 16 Nov 2022 3:33 UTC; 38 points)'s comment on I think EA will make it through stronger by (EA Forum;
- 21 Feb 2023 18:45 UTC; 37 points)'s comment on There are no coherence theorems by (EA Forum;
- 19 Jun 2022 17:42 UTC; 37 points)'s comment on On Deference and Yudkowsky’s AI Risk Estimates by (EA Forum;
- Why You Should Care About Goal-Directedness by 9 Nov 2020 12:48 UTC; 37 points) (
- Why should ethical anti-realists do ethics? by 16 Feb 2023 16:27 UTC; 37 points) (
- Definitions of “objective” should be Probable and Predictive by 6 Jan 2023 15:40 UTC; 35 points) (
- 12 May 2019 22:16 UTC; 35 points)'s comment on Coherent decisions imply consistent utilities by (
- 13 Feb 2019 5:06 UTC; 34 points)'s comment on Coherent behaviour in the real world is an incoherent concept by (
- [AN #135]: Five properties of goal-directed systems by 27 Jan 2021 18:10 UTC; 33 points) (
- Comment on Coherence arguments do not imply goal directed behavior by 6 Dec 2019 9:30 UTC; 30 points) (
- AI Alignment 2018-2019 Review by 28 Jan 2020 21:14 UTC; 28 points) (EA Forum;
- A conversation with Rohin Shah by 12 Nov 2019 1:31 UTC; 27 points) (EA Forum;
- Deriving Conditional Expected Utility from Pareto-Efficient Decisions by 5 May 2022 3:21 UTC; 24 points) (
- Bridging Expected Utility Maximization and Optimization by 5 Aug 2022 8:18 UTC; 24 points) (
- [AN #84] Reviewing AI alignment work in 2018-19 by 29 Jan 2020 18:30 UTC; 23 points) (
- On expected utility, part 1: Skyscrapers and madmen by 16 Mar 2022 21:58 UTC; 23 points) (
- On expected utility, part 1: Skyscrapers and madmen by 16 Mar 2022 21:54 UTC; 21 points) (EA Forum;
- Goal-directedness: my baseline beliefs by 8 Jan 2022 13:09 UTC; 21 points) (
- [AN #77]: Double descent: a unification of statistical theory and modern ML practice by 18 Dec 2019 18:30 UTC; 21 points) (
- Alignment Newsletter #40 by 8 Jan 2019 20:10 UTC; 21 points) (
- 4 Key Assumptions in AI Safety by 7 Nov 2022 10:50 UTC; 20 points) (
- Why Would AI “Aim” To Defeat Humanity? by 29 Nov 2022 18:59 UTC; 19 points) (EA Forum;
- Coherence arguments imply a force for goal-directed behavior by 6 Apr 2021 21:44 UTC; 19 points) (EA Forum;
- 28 Jan 2020 2:33 UTC; 19 points)'s comment on 2018 Review: Voting Results! by (
- Against the Backward Approach to Goal-Directedness by 19 Jan 2021 18:46 UTC; 19 points) (
- Alignment Newsletter #52 by 6 Apr 2019 1:20 UTC; 19 points) (
- [AN #167]: Concrete ML safety problems and their relevance to x-risk by 20 Oct 2021 17:10 UTC; 19 points) (
- What are we assuming about utility functions? by 2 Oct 2019 15:11 UTC; 17 points) (
- 21 Feb 2023 23:20 UTC; 15 points)'s comment on There are no coherence theorems by (EA Forum;
- Alignment Newsletter #35 by 4 Dec 2018 1:10 UTC; 15 points) (
- [AN #101]: Why we should rigorously measure and forecast AI progress by 27 May 2020 17:20 UTC; 15 points) (
- Goals and short descriptions by 2 Jul 2020 17:41 UTC; 14 points) (
- 1 Jan 2021 3:53 UTC; 14 points)'s comment on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy by (
- 14 Dec 2021 18:23 UTC; 13 points)'s comment on Consequentialism & corrigibility by (
- A review of “Agents and Devices” by 13 Aug 2021 8:42 UTC; 12 points) (
- 25 Jul 2019 17:50 UTC; 12 points)'s comment on IRL in General Environments by (
- Mesa-utility functions might not be purely proxy goals by 22 Apr 2022 22:16 UTC; 12 points) (
- 18 May 2020 18:36 UTC; 10 points)'s comment on Critical Review of ‘The Precipice’: A Reassessment of the Risks of AI and Pandemics by (EA Forum;
- Goal-Directedness: What Success Looks Like by 16 Aug 2020 18:33 UTC; 9 points) (
- 22 Jun 2021 8:57 UTC; 9 points)'s comment on I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the “utility function” abstraction by (
- 13 Jan 2019 18:44 UTC; 8 points)'s comment on Comments on CAIS by (
- 22 Dec 2018 9:30 UTC; 7 points)'s comment on Cognitive Bias of AI Researchers? by (
- 15 Aug 2019 19:57 UTC; 6 points)'s comment on AI Alignment Open Thread August 2019 by (
- 4 Key Assumptions in AI Safety by 7 Nov 2022 10:50 UTC; 5 points) (EA Forum;
- 5 Dec 2019 16:51 UTC; 5 points)'s comment on What are we assuming about utility functions? by (
- 28 Sep 2022 19:05 UTC; 5 points)'s comment on The LessWrong 2018 Book is Available for Pre-order by (
- 28 Dec 2022 21:16 UTC; 5 points)'s comment on Why The Focus on Expected Utility Maximisers? by (
- 2 Jan 2021 23:21 UTC; 5 points)'s comment on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy by (
- 21 Feb 2023 10:50 UTC; 4 points)'s comment on There are no coherence theorems by (EA Forum;
- 26 Apr 2020 14:34 UTC; 4 points)'s comment on Richard Ngo’s Shortform by (
- Goal-directedness: imperfect reasoning, limited knowledge and inaccurate beliefs by 19 Mar 2022 17:28 UTC; 4 points) (
- 14 Jun 2019 15:38 UTC; 3 points)'s comment on Let’s talk about “Convergent Rationality” by (
- 7 Jan 2019 20:26 UTC; 3 points)'s comment on AI safety without goal-directed behavior by (
- 8 Jan 2019 19:05 UTC; 3 points)'s comment on AI safety without goal-directed behavior by (
- 18 Feb 2019 5:48 UTC; 3 points)'s comment on Coherent behaviour in the real world is an incoherent concept by (
- 13 Feb 2019 12:01 UTC; 3 points)'s comment on Coherent behaviour in the real world is an incoherent concept by (
- 23 Sep 2020 14:12 UTC; 2 points)'s comment on Anthropomorphisation vs value learning: type 1 vs type 2 errors by (
- 13 Nov 2019 2:55 UTC; 2 points)'s comment on Rohin Shah on reasons for AI optimism by (
- 30 Jul 2021 18:55 UTC; 2 points)'s comment on Environmental Structure Can Cause Instrumental Convergence by (
- 18 May 2019 22:37 UTC; 1 point)'s comment on Training human models is an unsolved problem by (
- 26 Apr 2022 11:25 UTC; 1 point)'s comment on Mesa-utility functions might not be purely proxy goals by (
- 25 Dec 2022 1:50 UTC; 1 point)'s comment on DragonGod’s Shortform by (
- (A Failed Approach) From Precedent to Utility Function by 29 Apr 2023 21:55 UTC; 0 points) (
- A Brief Introduction to Algorithmic Common Intelligence, ACI . 1 by 5 Apr 2023 5:43 UTC; -4 points) (
This post was very surprising to me; it’s definitely an update about how to think about modelling intelligences, and opens up the space of possible theories a fair bit.
That said, I feel like I’ve not seen it reviewed by someone who has broadly the position this post responds to, so I’d really like it to be reviewed by someone else.
This post directly addresses what I think is the biggest conceptual hole in our current understanding of AGI: what type of goals will it have, and why? I think it’s been important in pushing people away from unhelpful EU-maximisation framings, and towards more nuanced and useful ways of thinking about goals.