[Question] What do coherence arguments actually prove about agentic behavior?

sunwillrise1 Jun 2024 9:37 UTC

128 points

(edit: discussions in the comments section have led me to realize there have been several conversations on LessWrong related to this topic that I did not mention in my original question post.

Since ensuring their visibility is important, I am listing them here: Rohin Shah has explained how consequentialist agents optimizing for universe-histories rather than world-states can display any external behavior whatsoever, Steven Byrnes has explored corrigibility in the framework of consequentialism by arguing poweful agents will optimize for future world-states at least to some extent, Said Achmiz has explained what incomplete preferences look like (1, 2, 3), EJT has formally defined preferential gaps and argued incomplete preferences can be an alignment strategy, John Wentworth has analyzed incomplete preferences through the lens of subagents but has then argued that incomplete preferences imply the existence of dominated strategies, and Sami Petersen has argued Wentworth was wrong by showing how incomplete preferences need not be vulnerable.)

In his first discussion with Richard Ngo during the 2021 MIRI Conversations, Eliezer retrospected and lamented:

In the end, a lot of what people got out of all that writing I did, was not the deep object-level principles I was trying to point to—they did not really get Bayesianism as thermodynamics, say, they did not become able to see Bayesian structures any time somebody sees a thing and changes their belief. What they got instead was something much more meta and general, a vague spirit of how to reason and argue, because that was what they’d spent a lot of time being exposed to over and over and over again in lots of blog posts.
Maybe there’s no way to make somebody understand why corrigibility is “unnatural” except to repeatedly walk them through the task of trying to invent an agent structure that lets you press the shutdown button (without it trying to force you to press the shutdown button), and showing them how each of their attempts fails; and then also walking them through why Stuart Russell’s attempt at moral uncertainty produces the problem of fully updated (non-)deference; and hope they can start to see the informal general pattern of why corrigibility is in general contrary to the structure of things that are good at optimization.
Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, “Oh, well, I’ll just build an agent that’s good at optimizing things but doesn’t use these explicit expected utilities that are the source of the problem!”
And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples.
And I have tried to write that page once or twice (eg “coherent decisions imply consistent utilities”) but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they’d have to do because this is in fact a place where I have a particular talent.

Eliezer is essentially claiming that, just as his pessimism compared to other AI safety researchers is due to him having engaged with the relevant concepts at a concrete level (“So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am. This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all”), his experience with and analysis of powerful optimization allows him to be confident in what the cognition of a powerful AI would be like. In this view, Vingean uncertainty prevents us from knowing what specific actions the superintelligence would take, but effective cognition runs on Laws that can nonetheless be understood and which allow us to grasp the general patterns (such as Instrumental Convergence) of even an “alien mind” that’s sufficiently powerful. In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility according to its world model to purse a goal that can be extremely different from what humans deem good.

When Eliezer says “they did not even do as many homework problems as I did,” I doubt he is referring to actual undergrad-style homework problems written nicely in LaTeX. Nevertheless, I would like to know whether there is some sort of publicly available repository of problem sets that illustrate the principles he is talking about. Meaning set-ups where you have an agent (of sorts) that is acting in a manner that’s either not utility-maximizing or even simply not consequentialist, followed by explanations of how you can exploit this agent. Given the centrality of consequentialism (and the associated money-pump and Dutch book-type arguments) to his thinking about advanced cognition and powerful AI, it would be nice to be able to verify whether working on these “homework problems” indeed results in the general takeaway Eliezer is trying to communicate.

I am particularly interested in this question in light of EJT’s thorough and thought-provoking post on how “There are no coherence theorems”. The upshot of that post can be summarized as saying that “there are no theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy” and that “nevertheless, many important and influential people in the AI safety community have mistakenly and repeatedly promoted the idea that there are such theorems.”

I was not a member of this site at the time EJT made his post, but given the large number of upvotes and comments on his post (123 and 116, respectively, at this time), it appears likely that it was rather popular and people here paid some attention to it. In light of that, I must confess to finding the general community reaction to his post rather baffling. Oliver Habryka wrote in response:

The post does actually seem wrong though.
I expect someone to write a comment with the details at some point (I am pretty busy right now, so can only give a quick meta-level gleam), but mostly, I feel like in order to argue that something is wrong with these arguments is that you have to argue more compellingly against completeness and possible alternative ways to establish dutch-book arguments.

However, the “details”, as far as I can tell, have never been written up. There was one other post by Valdes on this topic, who noted that “I have searched for a result in the literature that would settle the question and so far I have found none” and explicitly called for the community’s participation, but constructive engagement was minimal. John Wentworth, for his part, wrote a nice short explanation of what coherence looks like in a toy setting involving cache corruption and a simple optimization problem; this was interesting but not quite on point to what EJT talked about. But this was it; I could not find any other posts (written after EJT’s) that were even tangentially connected to these ideas. Eliezer’s own response was dismissive and entirely inadequate, not really contending with any of the arguments in the original post:

Eliezer: The author doesn’t seem to realize that there’s a difference between representation theorems and coherence theorems.
Cool, I’ll complete it for you then.
Transitivity: Suppose you prefer A to B, B to C, and C to A. I’ll keep having you pay a penny to trade between them in a cycle. You start with C, end with C, and are three pennies poorer. You’d be richer if you didn’t do that.
Completeness: Any time you have no comparability between two goods, I’ll swap them in whatever direction is most useful for completing money-pump cycles. Since you’ve got no preference one way or the other, I don’t expect you’ll be objecting, right?
Combined with the standard Complete Class Theorem, this now produces the existence of at least one coherence theorem. The post’s thesis, “There are no coherence theorems”, is therefore falsified by presentation of a counterexample. Have a nice day!
In the limit, you take a rock, and say, “See, the complete class theorem doesn’t apply to it, because it doesn’t have any preferences ordered about anything!” What about your argument is any different from this—where is there a powerful, future-steering thing that isn’t viewable as Bayesian and also isn’t dominated?

As EJT explained in detail,

EJT: These arguments don’t work. [...] As I note in the post, agents can make themselves immune to all possible money-pumps for completeness by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ Acting in accordance with this policy need never require an agent to act against any of their preferences. [...]

This whole situation appears very strange to me, as an outsider; isn’t this topic important enough to merit enough of an analysis that gets us beyond saying (in Habryka’s words) “it does seem wrong” to “it’s actually wrong, here’s the math that proves it”? I tried quite hard to find one, and was not able to. Given that coherence arguments are still crucial argumentative building blocks of the case made by users here that AI risk should be taken seriously (and that the general format of these arguments has remained unchanged), it leaves me with the rather uncanny impression that EJT’s post was seen by the community, acknowledged as important, yet never truly engaged with, and essentially… forgotten, or maybe ignored? It doesn’t seem like it has changed anyone’s behavior or arguments despite no refutation of it having appeared. Am I missing something important here?

What links here?

sunwillrise1 Jun 2024 9:37 UTC

128 points

39 comments6 min readLW link

Coherence Arguments AI

johnswentworth 2 Jun 2024 19:58 UTC
49 points
9
This going to be a somewhat-scattered summary of my own current understanding. My understanding of this question has evolved over time, and is therefore likely to continue to evolve over time.
Classic Theorems
First, there’s all the classic coherence theorems—think Complete Class or Savage or Dutch books or any of the other arguments you’d find in Stanford Encyclopedia of Philosophy. The general pattern of these is:
- Assume some arguably-intuitively-reasonable properties of an agent’s decisions (think e.g. lack of circular preferences).
- Show that these imply that the agent’s decisions maximize some expected utility function.
I would group objections to this sort of theorem into three broad classes:
1. Argue that some of the arguably-intuitively-reasonable properties are not actually necessary for powerful agents.
2. Be confused about something, and accidentally argue against something which is either not really what the theorem says or assumes a particular way of applying the theorem which is not the only way of applying the theorem.
  1. Argue that all systems can be modeled as expected utility maximizers (i.e. just pick a utility function which is maximized by whatever the system in fact does) and therefore the theorems don’t say anything useful.
For an old answer to (2.a), see the discussion under my mini-essay comment on Coherent Decisions Imply Consistent Utilities. (We’ll also talk about (2.a) some more below.) Other than that particularly common confusion, there’s a whole variety of other confusions; a few common types include:
- Only pay attention to the VNM theorem, which is relatively incomplete as coherence theorems go.
- Attempt to rely on some notion of preferences which is not revealed preference.
- Lose track of which things the theorems say an agent has utility and/or uncertainty over, i.e. what the inputs to the utility and/or probability functions are.
How To Talk About “Powerful Agents” Directly
While I think EJT’s arguments specifically are not quite right in a few ways, there is an importantly correct claim close to his: none of the classic coherence theorems say “powerful agent → EU maximizer (in a nontrivial sense)”. They instead say “<list of properties which are not obviously implied by powerful agency> → EU maximizer”. In order to even start to make a theorem of the form “powerful agent → EU maximizer (in a nontrivial sense)”, we’d first need a clean intuitively-correct mathematical operationalization of what “powerful agent” even means.
Currently, the best method I know of for making the connection between “powerful agency” and utility maximization is in Utility Maximization = Description Length Minimization. There, the notion of “powerful agency” is tied to optimization, in the sense of pushing the world into a relatively small number of states. That, in turn, is equivalent (the post argues) to expected utility maximization. That said, that approach doesn’t explicitly talk about “an agent” at all; I see it less as a coherence theorem and more as a likely-useful piece of some future coherence theorem.
What would the rest of such a future coherence theorem look like? Here’s my current best guess:
- We start from the idea of an agent optimizing stuff “far away” in spacetime. Coherence of Caches and Agents hints at why this is necessary: standard coherence constraints are only substantive when the utility/”reward” is not given for the immediate effects of local actions, but rather for some long-term outcome. Intuitively, coherence is inherently substantive for long-range optimizers, not myopic agents.
- We invoke the Utility Maximization = Description Length Minimization equivalence to say that optimization of the far-away parts of the world will be equivalent to maximization of some utility function over the far-away parts of the world.
- We then use basically similar arguments to Coherence of Caches and Agents, but generalized to operate on spacetime (rather than just states-over-time with no spatial structure) and allow for uncertainty.
Pareto-Optimality/Dominated Strategies
There are various claims along the lines of “agent behaves like <X>, or else it’s executing a pareto-suboptimal/dominated strategy”.
Some of these are very easy to prove; here’s my favorite example. An agent has a fixed utility function and performs pareto-optimally on that utility function across multiple worlds (so “utility in each world” is the set of objectives). Then there’s a normal vector (or family of normal vectors) to the pareto surface at whatever point the agent achieves. (You should draw a picture at this point in order for this to make sense.) That normal vector’s components will all be nonnegative (because pareto surface), and the vector is defined only up to normalization, so we can interpret that normal vector as a probability distribution. That also makes sense intuitively: larger components of that vector (i.e. higher probabilities) indicate that the agent is “optimizing relatively harder” for utility in those worlds. This says nothing at all about how the agent will update, and we’d need a another couple sentences to argue that the agent maximizes expected utility under the distribution, but it does give the prototypical mental picture behind the “pareto-optimal → probabilities” idea.
The most fundamental and general problem with pareto-optimality-based claims is that “pareto-suboptimal” implies that we already had a set of quantitative objectives in mind (or in some cases a “measuring stick of utility”, like e.g. money). But then some people will say “ok, but what if a powerful agent just isn’t pareto-optimal with respect to any resources at all, for instance because it just produces craptons of resources and then uses them inefficiently?”.
(Aside: “‘pareto-suboptimal’ implies we already had a set of quantitative objectives in mind” is also usually the answer to claims that all systems can be represented as expected utility maximizers. Sure, any system can be represented as an expected utility maximizer which is pareto-optimal with respect to some made-up objectives/resources which we picked specifically for this system. That does not mean all systems are pareto-optimal with respect to money, or energy, or other resources which we actually care about. Or, if using Utility Maximization = Description Length Minimization to ground out the quantitative objectives: not all systems are pareto-optimal with respect to optimization of some stuff far away in the world. That’s where the nontrivial content of most coherence theorems comes from: the quantitative objectives with respect to which the agent is pareto-optimal need to be things we care about for some reason.)
Approximate Coherence
What if a powerful agent just isn’t pareto-optimal with respect to any resources or far-away optimization targets at all? Or: even if you do expect powerful agents to be approximately pareto-optimal, presumably they will be approximately pareto optimal, not exactly pareto-optimal. What can we say about coherence then?
To date, I know of no theorems saying anything at all about approximate coherence. That said, this looks like more a case of “nobody’s done the legwork yet” rather than “people tried and failed”. It’s on my todo list.
My guess is that there’s a way to come at the problem with a thermodynamics-esque flavor, which would yield global bounds, for instance of roughly the form “in order for the system to apply n bits of optimization more than it could achieve with outputs independent of its inputs, it must observe at least m bits and approximate coherence to within m-n bits” (though to be clear I don’t yet know the right ways to operationalize all the parts of that sentence). The simplest version of a theorem of that form doesn’t work, but David and I have played with some variations and have some promising threads.
What links here?
- harfe 3 Feb 2025 16:26 UTC
  9 points
  2
  Parent
  
  Some of these are very easy to prove; here’s my favorite example. An agent has a fixed utility function and performs Pareto-optimally on that utility function across multiple worlds (so “utility in each world” is the set of objectives). Then there’s a normal vector (or family of normal vectors) to the Pareto surface at whatever point the agent achieves. (You should draw a picture at this point in order for this to make sense.) That normal vector’s components will all be nonnegative (because Pareto surface), and the vector is defined only up to normalization, so we can interpret that normal vector as a probability distribution. That also makes sense intuitively: larger components of that vector (i.e. higher probabilities) indicate that the agent is “optimizing relatively harder” for utility in those worlds. This says nothing at all about how the agent will update, and we’d need a another couple sentences to argue that the agent maximizes expected utility under the distribution, but it does give the prototypical mental picture behind the “Pareto-optimal → probabilities” idea.
  
  Here is an example (to point out a missing assumption): Lets say you are offered to bet on the result of a coin flip for $1$ dollar. You get $3$ dollars if you win, and your utility function is linear in dollars. You have three actions: “Heads”, “Tails”, and “Pass”. Then “Pass” performs Pareto-optimally across multiple worlds. But “Pass” does not maximize expected utility under any distribution.
  
  I think what is needed for the result is an additional convexity-like assumption about the utilities. This could be the set of achievable utility vectors is convex'', or even something weaker like every convex combination of achievable utility vectors is dominated by an achievable utility vector” (here, by utility vector I mean $(u_{w})_{w \in W}$ if $u_{w}$ is the utility of world $w$ ). If you already accept the concept of expected utility maximization, then you could also use mixed strategies to get the convexity-like assumption (but that is not useful if the point is to motivate using probabilities and expected utility maximization).
  
  Or: even if you do expect powerful agents to be approximately Pareto-optimal, presumably they will be approximately Pareto optimal, not exactly Pareto-optimal. What can we say about coherence then?
  
  The underlying math statement of some of these kind of results about Pareto-optimality seems to be something like this:
  
  If $¯ x$ is Pareto-optimal wrt utilities $u_{i}$ , $i = 1, \dots n$ and a convexity assumption (e.g. the set ${(u_{i} (x))_{i = 1}^{n} : x}$ is convex, or something with mixed strategies) holds, then there is a probability distribution $μ$ so that $¯ x$ is optimal for $U (x) = E_{i \sim μ} u_{i} (x)$ .
  
  I think there is a (relatively simple) approximate version of this, where we start out with approximate Pareto-optimality.
  
  We say that $¯ x$ is Pareto $ε$ —optimal if there is no (strong) Pareto-improvement by more than $ε$ (that is, there is no $x$ with $u_{i} (x) > u_{i} (¯ x) + ε$ for all $i$ ).
  
  Claim: If $¯ x$ is Pareto $ε$ —optimal and the convexity assumption holds, then there is a probability distribution $μ$ so that $¯ x$ is $ε$ -optimal for $U (x) = E_{i \sim μ} u_{i} (x)$ .
  
  Rough proof: Define $Y := {(u_{i} (x))_{i = 1}^{n} : x}$ and $¯ ¯¯ ¯ Y$ as the closure of $Y$ . Let $~ y \in ¯ ¯¯ ¯ Y$ be of the form $~ y = (u_{i} (¯ x) + δ)_{i = 1}^{n}$ for the largest $δ$ such that $~ y \in ¯ ¯¯ ¯ Y$ . We know that $δ \leq ε$ . Now $~ y$ is Pareto-optimal for $Y$ , and by the non-approximate version there exists a probability distribution $μ$ so that $~ y$ is optimal for $y \mapsto E_{i \sim μ} y_{i}$ . Then, for any $x$ , we have $\mathbb{E}{i\sim\mu} u_i(x) \leq \mathbb{E}{i\sim\mu} \tilde y_i = \mathbb{E}{i\sim\mu} (u_i(\bar x) + \delta)\le \varepsilon + \mathbb{E}{i\sim\mu} u_i(\bar x), $ that is, $¯ x$ is $ε$ -optimal for $U$ .
  - johnswentworth 3 Feb 2025 16:57 UTC
    10 points
    0
    Parent
    If you already accept the concept of expected utility maximization, then you could also use mixed strategies to get the convexity-like assumption (but that is not useful if the point is to motivate using probabilities and expected utility maximization).
    That is indeed what I had in mind when I said we’d need another couple sentences to argue that the agent maximizes expected utility under the distribution. It is less circular than it might seem at first glance, because two importantly different kinds of probabilities are involved: uncertainty over the environment (which is what we’re deriving), and uncertainty over the agent’s own actions arising from mixed strategies.
DaemonicSigil 1 Jun 2024 19:06 UTC
15 points
6
I remember reading the EJT post and left some comments there. The basic conclusions I arrived at are:
- The transitivity property is actually important and necessary, one can construct money-pump-like situations if it isn’t satisfied. See this comment
- If we keep transitivity, but not completeness, and follow a strategy of not making choices inconsistent with out previous choices, as EJT suggests, then we no longer have a single consistent utility function. However, it looks like the behaviour can still be roughly described as “picking a utility function at random, and then acting according to that utility function”. See this comment.
In my current thinking about non-coherent agents, the main toy example I like to think about is the agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action $a$ is proportional to $exp (β E [U | a])$ up to a normalization factor. By tuning $β$ we can affect whether the agent cares more about entropy or utility. This has a great resemblance to RLHF-finetuned language models. They’re trained to both achieve a high rating and to not have too great an entropy with respect to the prior implied by pretraining.
What links here?
- sunwillrise's comment on What do coherence arguments actually prove about agentic behavior? by sunwillrise (2 Jun 2024 14:58 UTC; 7 points)
- Thomas Kwa 2 Jun 2024 3:29 UTC
  6 points
  0
  Parent
  agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action a is proportional to exp(βE[U|a]) up to a normalization factor.
  Note that if the distribution of utility under the prior is heavy-tailed, you can get infinite utility even with arbitrarily low relative entropy, so the optimal policy is undefined. In the case of goal misspecification, optimization with a KL penalty may be unsafe or get no better utility than the prior.
EJT 18 Jun 2024 16:41 UTC
8 points
10
I’m coming to this two weeks late, but here are my thoughts.
The question of interest is:
- Will sufficiently-advanced artificial agents be representable as maximizing expected utility?
Rephrased:
- Will sufficiently-advanced artificial agents satisfy the VNM axioms (Completeness, Transitivity, Independence, and Continuity)?
Coherence arguments purport to establish that the answer is yes. These arguments go like this:
1. There exist theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.
2. Sufficiently-advanced artificial agents will not pursue dominated strategies.
3. So, sufficiently-advanced artificial agents will be representable as maximizing expected utility.
These arguments don’t work, because premise 1 is false: there are no theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies. In the year since I published my post, no one has disputed that.
Now to address two prominent responses:
‘I define ‘coherence theorems’ differently.’
In the post, I used the term ‘coherence theorems’ to refer to ‘theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ I took that to be the usual definition on LessWrong (see the Appendix for why), but some people replied that they meant something different by ‘coherence theorems’: e.g. ‘theorems that are relevant to the question of agent coherence.’
All well and good. If you use that definition, then there are coherence theorems. But if you use that definition, then coherence theorems can’t play the role that they’re supposed to play in coherence arguments. Premise 1 of the coherence argument is still false. That’s the important point.
‘The mistake is benign.’
This is a crude summary of Rohin’s response. Rohin and I agree that the Complete Class Theorem implies the following: ‘If an agent has complete and transitive preferences, then unless the agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ So the mistake is neglecting to say ‘If an agent has complete and transitive preferences…’ Rohin thinks this mistake is benign.
I don’t think the mistake is benign. As my rephrasing of the question of interest above makes clear, Completeness and Transitivity are a major part of what coherence arguments aim to establish! So it’s crucial to note that the Complete Class Theorem gives us no reason to think that sufficiently-advanced artificial agents will have complete or transitive preferences, especially since:
- Completeness doesn’t come for free.
- Money-pump arguments for Completeness (applied to artificial agents) aren’t convincing.
- Money-pump arguments for Transitivity assume Completeness.
- Training agents to violate Completeness might keep them shutdownable.
Two important points
Here are two important points, which I make to preclude misreadings of the post:
- Future artificial agents—trained in a standard way—might still be representable as maximizing expected utility.
Coherence arguments don’t work, but there might well be other reasons to think that future artificial agents—trained in a standard way—will be representable as maximizing expected utility.
- Artificial agents not representable as maximizing expected utility can still be dangerous.
So why does the post matter?
The post matters because ‘train artificial agents to have incomplete preferences’ looks promising as a way of ensuring that these agents allow us to shut them down.
AI safety researchers haven’t previously considered incomplete preferences as a solution, plausibly because these researchers accepted coherence arguments and so thought that agents with incomplete preferences were a non-starter.^[1] But coherence arguments don’t work, so training agents to have incomplete preferences is back on the table as a strategy for reducing risks from AI. And (I think) it looks like a pretty good strategy. I make the case for it in this post, and my coauthors and I will soon be posting some experimental results suggesting that the strategy is promising.
1. ^
  As I wrote elsewhere:
  The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.
What links here?
- sunwillrise's comment on Alignment: “Do what I would have wanted you to do” by Oleg Trott (13 Jul 2024 1:07 UTC; 49 points)
- Jeremy Gillen 19 Jun 2024 3:52 UTC
  2 points
  0
  Parent
  I find the money pump argument for completeness to be convincing.
  The rule that you provide as a counterexample (Caprice rule) is one that gradually completes the preferences of the agent as it encounters a variety of decisions. You appear to agree with that this is the case. This isn’t a large problem for your argument. The big problem is that when there are lots of random nodes in the decision tree, such that the agent might encounter a wide variety of potentially money-pumping trades, the agent needs to complete its preferences in advance, or risk its strategy being dominated.
  You argue with John about this here, and John appears to have dropped the argument. It looks to me like your argument there is wrong, at least when it comes to situations where there are sufficient assumptions to talk about coherence (which is when the preferences are over final outcomes, rather than trajectories).
  - EJT 19 Jun 2024 16:01 UTC
    3 points
    0
    Parent
    I take the ‘lots of random nodes’ possibility to be addressed by this point:
    And this point generalises to arbitrarily complex/realistic decision trees, with more choice-nodes, more chance-nodes, and more options. Agents with a model of future trades can use their model to predict what they’d do conditional on reaching each possible choice-node, and then use those predictions to determine the nature of the options available to them at earlier choice-nodes. The agent’s model might be defective in various ways (e.g. by getting some probabilities wrong, or by failing to predict that some sequences of trades will be available) but that won’t spur the agent to change its preferences, because the dilemma from my previous comment recurs: if the agent is aware that some lottery is available, it won’t choose any dispreferred lottery; if the agent is unaware that some lottery is available and chooses a dispreferred lottery, the agent’s lack of awareness means it won’t be spurred by this fact to change its preferences. To get over this dilemma, you still need the ‘non-myopic optimiser deciding the preferences of a myopic agent’ setting, and my previous points apply: results from that setting don’t vindicate coherence arguments, and we humans as non-myopic optimisers could decide to create artificial agents with incomplete preferences.
    Can you explain why you think that doesn’t work?
    To elaborate a little more, introducing random nodes allows for the possibility that the agent ends up with some outcome that they disprefer to the outcome that they would have gotten (as a matter of fact, unbeknownst to the agent) by making different choices. But that’s equally true of agents with complete preferences.
    - Jeremy Gillen 20 Jun 2024 1:38 UTC
      2 points
      0
      Parent
      I intended for my link to point to the comment you linked to, oops.
      I’ve responded here, I think it’s better to just keep one thread of argument, in a place where there is more necessary context.
Liron 1 Jun 2024 17:22 UTC
−5 points
−4
I guess I just don’t see it as a weak point in the doom argument that goal-orientedness is a convergent attractor in the space of self-modifying intelligences?

It feels similar to pondering the familiar claim of evolution, that systems that copy themselves and seize resources are an attractor state. Sure it’s not 100% proven but it seems pretty solid.
- sunwillrise 2 Jun 2024 11:36 UTC
  10 points
  15
  Parent
  I guess I just don’t see it as a weak point in the doom argument
  This is kind of baffling to read, particularly in light of the statement by Eliezer that I quoted at the very beginning of my post.
  If the argument is (and indeed it is) that “many superficially appealing solutions like corrigibility, moral uncertainty etc are in general contrary to the structure of things that are good at optimization” and the way we see this is by doing homework exercises within an expected utility framework, and the reason why we must choose an EU framework is because “certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples,” because agents which don’t maximize expected utility are always exploitable, it seems quite straightforward that if it isn’t true that these agents are exploitable, then the entire argument collapses.
  Of course it doesn’t mean the conclusion is now wrong, but you need some other reason for reaching that conclusion than the typical money pumps and Dutch books that were being offered up as justifications.
  goal-orientedness is a convergent attractor in the space of self-modifying intelligences
  This also requires a citation, or at the very least some reasoning; I’m not aware of any theorems that show goal-orientedness is a convergent attractor, but I’d be happy to learn more.
  If the reason why you think this is true is because of intuitions about what powerful cognition must be like, but the source of those intuitions was the set of coherence arguments that are being discussed in this question post, then learning the coherence arguments do not extend as far as they were purported to should cause you to rethink those intuitions and the conclusions you had previously reached on their basis, as they are now tainted by that confusion.
  It feels similar to pondering the familiar claim of evolution, that systems that copy themselves and seize resources are an attractor state. Sure it’s not 100% proven but it seems pretty solid.
  Sure, it seems solid, and it also seems plausible that formalizing this should be straightforward for an expert in the domain. I’m not sure why this is a good analogy to the topic of agentic behavior and cognition.
  What links here?
  - Liron 21 Jul 2024 3:55 UTC
    −5 points
    0
    Parent
    > goal-orientedness is a convergent attractor in the space of self-modifying intelligences
    This also requires a citation, or at the very least some reasoning; I’m not aware of any theorems that show goal-orientedness is a convergent attractor, but I’d be happy to learn more.
    Ok here’s my reasoning:
    When an agent is goal-oriented, they want to become more goal-oriented, and maximize the goal-orientedness of the universe with respect to their own goal. So if we diagram the evolution of the universe’s goal-orientedness, it has the shape of an attractor.
    There are plenty of entry paths where some intelligence-improving process spits out a goal-oriented general intelligene (like biological evolution did), but no exit path where a universe whose smartest agent is super goal-oriented ever leads to that no longer being the case.
    - Liron 21 Jul 2024 18:44 UTC
      1 point
      0
      Parent
      
      When an agent is goal-oriented, they want to become more goal-oriented, and maximize the goal-orientedness of the universe with respect to their own goal
      
      Because expected value tells us that the more resources you control, the more robust you are to maximizing your probability of success in the face of what may come at you, and the higher your maximum possible utility is (if you have a utility function without an easy-to-hit max score).
      
      “Maximizing goal-orientedness of the universe” was how I phrased the prediction that conquering resources involves having them aligned to your goal / aligned agents helping you control them.

Rohin Shah 3 Jun 2024 2:05 UTC
26 points
10
“nevertheless, many important and influential people in the AI safety community have mistakenly and repeatedly promoted the idea that there are such theorems.”
I responded on the EA Forum version, and my understanding was written up in this comment.
TL;DR: EJT and I both agree that the “mistake” EJT is talking about is that when providing an informal English description of various theorems, the important and influential people did not state all the antecedents of the theorems.
Unlike EJT, I think this is totally fine as a discourse norm, and should not be considered a “mistake”. I also think the title “there are no coherence theorems” is hyperbolic and misleading, even though it is true for a specific silly definition of “coherence theorem”.
- sunwillrise 3 Jun 2024 7:29 UTC
  31 points
  16
  Parent
  Unlike EJT, I think this is totally fine as a discourse norm, and should not be considered a “mistake”. I also think the title “there are no coherence theorems” is hyperbolic and misleading, even though it is true for a specific silly definition of “coherence theorem”.
  I’m not sure about this.
  Most of the statements EJT quoted, standing in isolation, might well be “fine” in the sense that they are just imprecise and not grave violations of discourse norms. The problem, it seems to me, lies in what the implications of those imprecise statements are in terms of the community’s understanding of this topic and in the manner in which they are used, by those same important and influential people, to argue for conclusions that don’t seem to me to be locally valid.
  If those statements are just understood as informal expressions of the ideas John Wentworth was getting at, namely that when we “assume some arguably-intuitively-reasonable properties of an agent’s decisions”, we can then “show that these imply that the agent’s decisions maximize some expected utility function”, then this is perfectly okay.
  But if those statements are mistaken by the community to mean (and used as soldiers to argue in favor of the idea) that “powerful agents must be EU maximizers in a non-trivial sense” or even the logically weaker claim that “such entities would necessarily be exploitable if they don’t self-modify into an EU maximizer”, then we have a problem.
  Moreover, as I have written in another comment here, if the reason why someone would think that think that “many superficially appealing solutions like corrigibility, moral uncertainty etc are in general contrary to the structure of things that are good at optimization” is because of intuitions about what powerful cognition must be like, but the source of those intuitions was the set of coherence arguments that are being discussed in the question post, then learning the coherence arguments do not extend as far as they were purported to should cause that person to rethink those intuitions and the conclusions they had previously reached on their basis, as they are now tainted by that confusion.
  Now take the exact quote from Eliezer that I mentioned at the top of my post (bolding is my addition):
  Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, “Oh, well, I’ll just build an agent that’s good at optimizing things but doesn’t use these explicit expected utilities that are the source of the problem!”
  And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples.
  And I have tried to write that page once or twice (eg “coherent decisions imply consistent utilities”) but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they’d have to do because this is in fact a place where I have a particular talent.
  Which category from the above do you think this most neatly fits into? The innocuous imprecise expression of a mathematical theorem, or an assertion of an unproven result that is likely to induce misconceptions into the minds of the readers? It certainly induced me to believe something incorrect, and I doubt I was the only one.
  What about the following?
  I think that to contain the concept of Utility as it exists in me, you would have to do homework exercises I don’t know how to prescribe. Maybe one set of homework exercises like that would be showing you an agent, including a human, making some set of choices that allegedly couldn’t obey expected utility, and having you figure out how to pump money from that agent (or present it with money that it would pass up).
  Like, just actually doing that a few dozen times.
  Maybe it’s not helpful for me to say this? If you say it to Eliezer, he immediately goes, “Ah, yes, I could see how I would update that way after doing the homework, so I will save myself some time and effort and just make that update now without the homework”, but this kind of jumping-ahead-to-the-destination is something that seems to me to be… dramatically missing from many non-Eliezers. They insist on learning things the hard way and then act all surprised when they do. Oh my gosh, who would have thought that an AI breakthrough would suddenly make AI seem less than 100 years away the way it seemed yesterday? Oh my gosh, who would have thought that alignment would be difficult?
  Utility can be seen as the origin of Probability within minds, even though Probability obeys its own, simpler coherence constraints.
  What purpose does the assertion that Eliezer “save[s] [himself] some time and effort and just make[s] that update now without the homework” serve other than to signal to the audience that this is such a clearly correct worldview that if only they were as smart and experienced as Eliezer, they too would immediately understand that what he is saying is completely true? Is it really not a violation of “discourse norms” to use this type of rhetoric when your claims are incorrect as written?
  What about “Sufficiently optimized agents appear coherent”? That one, from its very title, is directly and unambiguously asserting something that has not been proven, and in any case is probably not correct (in a non-trivial sense):
  Again, we see a manifestation of a powerful family of theorems showing that agents which cannot be seen as corresponding to any coherent probabilities and consistent utility function will exhibit qualitatively destructive behavior, like paying someone a cent to throw a switch and then paying them another cent to throw it back.
  I can keep going, but I think the general pattern is probably quite clear by now.
  As it turns out, all of the worst examples are from Eliezer specifically, and the one example of yours that EJT quoted is basically entirely innocuous, so I want to make clear that I don’t think you specifically have done anything wrong or violated any norms.
  What links here?
  - Rohin Shah 3 Jun 2024 18:27 UTC
    5 points
    4
    Parent
    I agree Eliezer’s writing often causes people to believe incorrect things and there are many aspects of his discourse that I wish he’d change, including some of the ones you highlight. I just want to push back on the specific critique of “there are no coherence theorems”.
    (In fact, I made this post because I too previously believed incorrect things along these lines, and those incorrect beliefs were probably downstream of arguments made by Eliezer or MIRI, though it’s hard to say exactly what the influences were.)
Dan H 2 Jun 2024 17:41 UTC
14 points
5
Key individuals that the community is structured around just ignored it, so it wasn’t accepted as true. (This is a problem with small intellectual groups.)
Steven Byrnes 2 Jun 2024 2:32 UTC
13 points
0
You can try reading my post Consequentialism & Corrigibility, see if you find it helpful.
- sunwillrise 2 Jun 2024 11:06 UTC
  11 points
  4
  Parent
  Thank you for the link, Steve. I recall having read your post a while back, but for some reason it slipped my mind while I was pondering the original question here. That being said, while your writing is tangentially related, it is also not quite on point to my inquiry and concerns.
  Your post is somewhat of a direct answer to @Rohin Shah’s illustration in “Coherence arguments do not entail goal-directed behavior” (yet another post I should have linked to in my original post) of the fact that an agent optimizing for the fulfillment of preferences over universe-histories as opposed to mere future world states can display “any behavior whatsoever.” More specifically, you explored corrigibility proposals in light of this fact, concluding that “preferences purely over future states are just fundamentally counter to corrigibility.” I haven’t thought about this topic enough to come to a definite judgment either way, but in any case, this is much more relevant to Eliezer’s tangent about corrigibility (in the quote I selected at the top of my post) than to the different object-level concern about whether coherence arguments imply the degree of certitude Eliezer has about how sufficiently powerful agents will behave in the real world.
  Indeed, the section of your post that is by far the most relevant to my interests here is the following (which you wrote as more of an aside):
  (Edit to add: There are very good reasons to expect future powerful AGIs to act according to preferences over distant-future states, and I join Eliezer in roundly criticizing people who think we can build an AGI that never does that; see this comment for discussion.)
  For completeness and to save others the effort of clicking on that link onto a new tab, the relevant part of your referenced comment says the following:
  I feel like I’m stuck in the middle…
  On one side of me sits Eliezer, suggesting that future powerful AGIs will make decisions exclusively to advance their explicit preferences over future states
  On the other side of me sits, umm, you, and maybe Richard Ngo, and some of the “tool AI” and GPT-3-enthusiast people, declaring that future powerful AGIs will make decisions based on no explicit preference whatsoever over future states.
  Here I am in the middle, advocating that we make AGIs that do have preferences over future states, but also have other preferences.
  I disagree with the 2nd camp for the same reason Eliezer does: I don’t think those AIs are powerful enough. More specifically: We already have neat AIs like GPT-3 that can do lots of neat things. But we have a big problem: sooner or later, somebody is going to come along and build a dangerous accident-prone consequentialist AGI. We need an AI that’s both safe, and powerful enough to solve that big problem. I usually operationalize that as “able to come up with good original creative ideas in alignment research, and/or able to invent powerful new technologies”. I think that, for an AI to do those things, it needs to do explicit means-end reasoning, autonomously come up with new instrumental goals and pursue them, etc. etc. For example, see discussion of “RL-on-thoughts” here.
  Unfortunately, this still doesn’t present the level of evidence or reasoning that would persuade me (or even move me significantly) in the direction of believing powerful AI will necessarily (or even likely) optimize (at least in large part) for explicit preferences over future world states. It suffers from the same general problems that previous writings on this topic (including Eliezer’s) do, namely the fact that they communicate strongly-held intuitions about “certain structures of cognition [...] that are good at stuff and do the work” without a solid grounding in either formal mathematical reasoning or explicit real-world empirical data or facts (although I suppose this is better than claiming the math actually proves the intuitions are right, when in fact it doesn’t).
  We all know intuitions aren’t magic but are nonetheless useful about complex topics when they function as gears in understanding, so I am certainly not claiming reliance on intuitions is bad per se, especially in dialogues about topics like AGI where empirical analyses are inherently tricky. On the contrary, I think I understand the relevant dynamic here quite well: you (just like Eliezer) have spent a ton of time thinking about and working on determining how powerful optimizers reason, and in the process you have gained a lot of (mostly implicit) understandings of this. Analogously to how nobody starts off with great intuition about chess but can develop it tremendously over time after playing games, working through analyses and doing puzzles, you have trained your intuition about consequentialist reasoning by working hard on the alignment problem and should thus be more attuned to the ground-level territory than someone who hasn’t done the (in Eliezer-speak) “homework problems”. I am nonetheless still left with the (mostly self-imposed) task of figuring out whether those intuitions are correct.
  One way of doing that would be to obtain irrefutable mathematical proof that Expected Utility maximizers come about when we optimize hard enough for intelligence of an AI system, or at the very least that such entities would necessarily be exploitable if they don’t self-modify into an EU maximizer. Indeed, this is the very reason I made this question post, but it seems like this type of proof isn’t actually available given that none of the answers or comments thus far have patched the holes in Eliezer’s arguments or explained how EJT might have been wrong. Another way of doing it would be to defer to the conclusions that smart people with experience in this area like you or Eliezer have reached; however, this also doesn’t work because suffers from a few major issues:
  1. it doesn’t seem like there is a strong consensus about these topics among people who have spent significant portions of their lives studying it. Views that seem plausible and consistent on their face and which oppose Eliezer’s on the question of advanced cognition include lsusr’s, Richard Ngo’s, Quintin Pope’s (1, 2, 3 etc), Alex Turner’s (1, 2, 3, 4, 5 etc), among others. Deference is thus insufficient because it’s unclear who to defer to.
  2. given that I already personally disagree entirely with some of Eliezer’s thinking on important topics where I believe I understand his perspective well, such as whether CEV makes sense conceptually (which I might write a post about at some point), it far less plausible to me that such deference is the correct way to go about this.
  3. the one example of general intelligences we have seen in the real world thus far (namely humans) are not utility maximizers and in any case suffer from serious problems in defining True Values so that “preferences” make coherent sense.
  Of course, the one other way out of this conundrum is for me to follow the standard “think for yourself and reach your own conclusions” advice that’s usually given. Unfortunately, that also can’t work if “think for yourself” means “hypothesize really hard without any actual experimentation, HPJEV-style”. As it turns out, while I am not an alignment researcher, I think of myself as a reasonably smart guy who understands Eliezer’s perspective (“In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility according to its world model to purse a goal that can be extremely different from what humans deem good”) and who has spent a fair bit of time pondering these matters, and even after all that I don’t see why Eliezer’s perspective is even likely to be true (let alone reaching the level of confidence he apparently has in these conclusions).
  Indeed, in order for me (and for other people like me, which I imagine exist out here) to proceed on this matter, I would need something that has more feedback loops that allow me to progress while remaining grounded to reality. In other words, I would need to see for myself the analogues to the aforementioned “games, analyses, and puzzles” that build the chess intuition. This is why an important part of my original post (which nobody has responded to yet) was the following:
  When Eliezer says “they did not even do as many homework problems as I did,” I doubt he is referring to actual undergrad-style homework problems written nicely in LaTeX. Nevertheless, I would like to know whether there is some sort of publicly available repository of problem sets that illustrate the principles he is talking about. Meaning set-ups where you have an agent (of sorts) that is acting in a manner that’s either not utility-maximizing or even simply not consequentialist, followed by explanations of how you can exploit this agent. Given the centrality of consequentialism (and the associated money-pump and Dutch book-type arguments) to his thinking about advanced cognition and powerful AI, it would be nice to be able to verify whether working on these “homework problems” indeed results in the general takeaway Eliezer is trying to communicate.
  What links here?
  - Steven Byrnes 2 Jun 2024 12:16 UTC
    19 points
    10
    Parent
    Again, I think Eliezer’s perspective is:
    HYPOTHESIS 1: “future powerful AIs will have preferences purely over states of the world in the distant future”
    CONSEQUENCE 1: “AIs will satisfy coherence theorems, corrigibility is unnatural, etc.”
    I think Eliezer is wrong because I think HYPOTHESIS 1 is likely to be false.
    (I do think the step “If HYPOTHESIS 1 Then CONSEQUENCE 1” is locally valid—I agree with Eliezer about that.)
    I do however believe:
    HYPOTHESIS 2: future powerful AIs will have various preferences, at least some of which concern states of the world in the distant future.
    This is weaker than HYPOTHESIS 1. HYPOTHESIS 2 does NOT imply CONSEQUENCE 1. In fact, that if you grant HYPOTHESIS 2, it’s hard to get any solid conclusions out of it at all. More like “well, things might go wrong, but also they might not”. It’s hard to say anything more than that without talking about the AI training approach in some detail.
    I think Eliezer’s alleged “homework problems” were about (the correct step) “If HYPOTHESIS 1 then CONSEQUENCE 1”, and that he didn’t do enough “homework problems” to notice that HYPOTHESIS 1 may be false.
    You seem to be interested yet another possibility:
    HYPOTHESIS 3: it’s possible for there to be future powerful AIs that have no preferences whatsoever about states of the world in the distant future.
    I think this is wrong but I agree that I don’t have a rock-solid argument for it being wrong (I don’t think I ever claimed to). Maybe see §5.3 of my Process-Based Supervision post for some more (admittedly intuitive) chatting about why I think (one popular vision of) Hypothesis 3 is wrong. Again, if you’re just trying to make the point that Eliezer is over-confident in doom for unsound reasons, then the argument over HYPOTHESIS 2 versus HYPOTHESIS 3 is unnecessary for that point. HYPOTHESIS 2 is definitely a real possibility (proof: human brains exist), and that’s already enough to undermine our confidence in CONSEQUENCE 1.
    What links here?
    Seth Herd's comment on Seth Herd’s Shortform by Seth Herd (1 Jun 2024 20:19 UTC; 80 points)
    sunwillrise's comment on Alignment: “Do what I would have wanted you to do” by Oleg Trott (13 Jul 2024 1:07 UTC; 49 points)
    Money Pump Arguments assume Memoryless Agents. Isn’t this Unrealistic? by Dalcy (16 Aug 2024 4:16 UTC; 26 points)
    sunwillrise's comment on The Standard Analogy by Zack_M_Davis (5 Jun 2024 12:17 UTC; 26 points)
    sunwillrise's comment on What is it to solve the alignment problem? (Notes) by Joe Carlsmith (25 Aug 2024 10:00 UTC; 8 points)
    Noosphere89's comment on A shot at the diamond-alignment problem by TurnTrout (26 Dec 2024 16:11 UTC; 8 points)
    sunwillrise's comment on Instruction-following AGI is easier and more likely than value aligned AGI by Seth Herd (12 Jul 2024 15:34 UTC; 6 points)
    sunwillrise's comment on When is a mind me? by Rob Bensinger (8 Jul 2024 13:51 UTC; 5 points)
    sunwillrise's comment on Against functionalism: a self dialogue by Algon (10 Aug 2025 8:19 UTC; 4 points)
    sunwillrise's comment on When is a mind me? by Rob Bensinger (9 Jul 2024 23:32 UTC; 4 points)
    sunwillrise's comment on Money Pump Arguments assume Memoryless Agents. Isn’t this Unrealistic? by Dalcy (16 Aug 2024 8:03 UTC; 3 points)
Algon 1 Jun 2024 12:02 UTC
10 points
15
I’ve observed much the same thing. And my conclusion is that we don’t have any proofs that ever greater intelligence converges to EU-maximizing agents.
Edit: We don’t even have an argument that meets the standards of rigour in physics.
Thomas Kwa 6 Jun 2024 18:11 UTC
7 points
3
I think that it’s mostly Eliezer who believes so strongly in utility functions. Nate Soares’ post Deep Deceptiveness, which I claim is a central part of the MIRI threat model insofar as there is one, doesn’t require an agent coherent enough to satisfy VNM over world-states. In fact it can depart from coherence in several ways and still be capable and dangerous:
- It can flinch away from dangerous thoughts;
- Its goals can drift over time;
- Its preferences can be incomplete;
- Maybe every 100 seconds it randomly gets distracted for 10 seconds
The important property it has is a goal about the real world and applies general problem-solving skills to achieve it, and has no stable desire to use its full intelligence to be helpful/good for humans. No one has formalized this and so no one has proved interesting things about such an agent model.
- habryka 6 Jun 2024 19:54 UTC
  4 points
  0
  Parent
  I would be somewhat surprised if Eliezer and Nate disagree very much here, though you might know better. So I would mostly see Nate’s post as clarification of both Eliezer’s and Nate’s views.
  - Thomas Kwa 6 Jun 2024 20:58 UTC
    5 points
    0
    Parent
    I do think they disagree based on my experience working with Nate and Vivek. Eliezer has said he has only shared 40% of his models with even Nate for infosec reasons [1] (which surprised me!), so it isn’t surprising to me that they would have different views. Though I don’t know Eliezer well, I think he does believe in the basic point of Deep Deceptiveness (because it’s pretty basic) but also believes in coherence/utility functions more than Nate does. I can maybe say more privately but if it’s important asking one of them is better.
    
    [1] This was a while ago so he might have actually said that Nate only has 40% of his models. But either way my conclusion is valid.
Seth Herd 3 Jun 2024 4:44 UTC
5 points
0
Excellent work. This should be part of my cruxes of alignment difficulty. It’s a bit more obscure, but I think it completes the puzzle in explaining what gets EY to >99%. It seems this consideration is a large part of what pushes Eliezer’s p(doom) so high. If this is true, it prevents corrigibility. Giving up on corrigibility of some form would push most of our alignment difficulty estimates way up.
I also think it’s wrong, but it’s not obvious from the arguments here, and I’m not sure. Just tacking on the “don’t reverse your trades even if you don’t care” principal prevents the agent from being taken advantage of, but that’s not what we care about. We care whether a superintelligent, rational agent could coherently be corrigible.
We can make consequentialism one motivation among many, as Steve Byrnes points out, but then we had better be sure that those preferences are stable under learning, reflection, and self-modification.
My proposed solution is to make the primary goal a “pointer” to the stated preferences of a human principal, so that consequentialism enters in only when the principal asks the AGI to accomplish a specific goal, and then only until the principal changes their mind. This can also be intuitively thought of as instruction-following. I discuss that scheme here. It’s closely related to Christiano’s definition and goal of corrigibility, and it seems like the obvious thing to try when actually launching an AGI that will have a slow initial takeoff.
- johnswentworth 4 Jun 2024 20:31 UTC
  5 points
  0
  Parent
  FYI, I thought your shortform here was an unusually excellent summary of cruxes, but I don’t think coherence is the main missing piece which gets Eliezer to 99%+. (Also, I think I understand Eliezer’s models better than the large majority of people on LW, but still definitely not perfectly.)
  I think the main “next piece” missing is that Eliezer basically rejects the natural abstraction hypothesis; he expects that powerful AI will reason in internal ontologies thoroughly alien to humans. That makes not just full-blown alignment hard, but even “relatively easy” things like instruction-following hard in the relevant regime.
  (Also there are a few other pieces which your shortform didn’t talk about much which are relevant to high-certainty-of-doom, but I expect those were pieces which you intentionally didn’t focus on much—like e.g. near-certainty that there will be many-OOM-equivalent software improvements very rapidly once AI crosses the critical threshold of being able to do AI research.)
  - Thane Ruthenis 4 Jun 2024 22:46 UTC
    6 points
    0
    Parent
    I think the main “next piece” missing is that Eliezer basically rejects the natural abstraction hypothesis
    Mu, I think. I think the MIRI view on the matter is that the internal mechanistic implementation of an AGI-trained-by-the-SGD would be some messy overcomplicated behemoth. Not a relatively simple utility-function plus world-model plus queries on it plus cached heuristics (or whatever), but a bunch of much weirder modules kludged together in a way such that their emergent dynamics result in powerful agentic behavior.^[1]
    The ontological problems with alignment would stem not from the fact that the AI is using alien concepts, but from its own internal dynamics being absurdly complicated and alien. It wouldn’t have a well-formatted mesa-objective, for example, or “emotions”, or a System 1 vs System 2 split, or explicit vs. tacit knowledge. It would have a dozen other things which fulfill the same functions that the aforementioned features of human minds fulfill in humans, but they’d be split up and recombined in entirely different ways, such that most individual modules would have no analogues in human cognition at all.
    Untangling it would be a “second tier” of the interpretability problem, which the current interpretability research didn’t yet even get a glimpse of.
    And, sure, maybe at some higher level of organization, all that complexity would be reducible to simple-ish agentic behavior. Maybe a powerful-enough pragmascope would be able to see past all that and yield us a description of the high-level implementation directly. But I don’t think the MIRI view is hopeful regarding getting such tools.
    Whether the NAH is or is not true doesn’t really enter into it.
    Could be I’m failing the ITT here, of course. But this post gives me this vibe, as does this old write-up. Choice quote^[2]:
    The reason why we can’t bind a description of ‘diamond’ or ‘carbon atoms’ to the hypothesis space used by AIXI or AIXI-tl is that the hypothesis space of AIXI is all Turing machines that produce binary strings, or probability distributions over the next sense bit given previous sense bits and motor input. These Turing machines could contain an unimaginably wide range of possible contents
    (Example: Maybe one Turing machine that is producing good sequence predictions inside AIXI, actually does so by simulating a large universe, identifying a superintelligent civilization that evolves inside that universe, and motivating that civilization to try to intelligently predict future future bits from past bits (as provided by some intervention). To write a formal utility function that could extract the ‘amount of real diamond in the environment’ from arbitrary predictors in the above case , we’d need the function to read the Turing machine, decode that universe, find the superintelligence, decode the superintelligence’s thought processes, find the concept (if any) resembling ‘diamond’, and hope that the superintelligence had precalculated how much diamond was around in the outer universe being manipulated by AIXI.)
    Obviously it’s talking about AIXI, not ML models, but I assume the MIRI view has a directionally similar argument regarding them.
    Or, in other words: what the MIRI view rejects isn’t the NAH, but some variant of the simplicity-prior argument. It doesn’t believe that the SGD would yield nicely formatted agents; that the ML training loops produce pressures shaping minds this way.^[3]
    ^
    This powerful agentic behavior would then of course be able to streamline its own implementation, once it’s powerful enough, but that’s what the starting point would be – and also what we’d need to align, since once it has the extensive self-modification capabilities to streamline itself, it’d be too late to tinker with it.
    ^
    Although now that I’m looking at it, this post is actually a mirror of the Arbital page, which has three authors, so I’m not entirely sure this segment was written by Eliezer...
    ^
    Note that this also means that formally solving the Agent-Like Structure Problem wouldn’t help us either. It doesn’t matter how theoretically perfect embedded agents are shaped, because the agent we’d be dealing with wouldn’t be shaped like this. Knowing how it’s supposed to be shaped would help only marginally, at best giving us a rough idea regarding how to start untangling the internal dynamics.
    What links here?
    Thane Ruthenis's comment on My AI Model Delta Compared To Yudkowsky by johnswentworth (10 Jun 2024 21:02 UTC; 23 points)
  - Seth Herd 4 Jun 2024 21:09 UTC
    4 points
    0
    Parent
    Thanks, that makes sense, thinking about EYs writings. I’ll add that briefly to the piece. What’s the best reference, if you have a moment?
    
    I think LLMs are already mostly finding natural abstractions. They’ll have some weird cross-talk, like the golden gate bridge being mixed with fog, but humans have that too, to maybe a lesser degree, and we can still communicate pretty well about abstractions, at least if we’re careful.
    
    I’m glad you liked the crux list. I think it’s really important to keep asking ourselves why others have different takes. The topic is too important to do the standard thing and just say “well they don’t get it”.
    - johnswentworth 4 Jun 2024 23:13 UTC
      8 points
      0
      Parent
      What’s the best reference, if you have a moment?
      Eliezer’s List O’Doom probably has a short statement in there somewhere, if you want a quote on his position. Much of his back-and-forth with Quintin is also about rejecting natural abstraction, but I don’t know of a short pithy summary in that corpus. (More generally, it’s pretty clear from my standpoint that there are basically two cruxes between Eliezer and Quintin, because my own models look mostly like Eliezer’s if I flip the natural abstraction bit and mostly like Quintin’s if I flip a particular bit having to do with ease of outer alignment.)
      If you want a reference on the natural abstraction hypothesis more generally, I introduced the term in Alignment By Default.
Kerrigan 14 Apr 2025 2:22 UTC
3 points
0
Is exploitability necessarily unstable? Could there be a tolerable level of exploitability, especially if it allows for tradeoffs with desirable characteristics that are only available to non-EU maximizers?”
- Nathan Helm-Burger 14 Apr 2025 8:40 UTC
  2 points
  0
  Parent
  Also, both human and deep neural net agents are somewhat stochastic, so they may be randomly intermittently exploitable.
Review Bot 3 Jun 2024 14:12 UTC
1 point
0
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
cubefox 3 Jun 2024 11:53 UTC
1 point
0
@EJT In case you didn’t see this already
Oskar Mathiasen 2 Jun 2024 9:47 UTC
1 point
0
I don’t understand how you are using incompleteness. For example, to me the sentence

“agents can make themselves immune to all possible money-pumps for completeness by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’”

Sounds like “agents can avoid all money pumps for completeness by completing their preferences in a random way.” Which is true but doesn’t seem like much of a challenge to completeness.

Can you explain what behavior is allowed under the first but isn’t possible under my rephrasing?

Similarly can we make explicit what behavior counts as two options being incomparable?
- sunwillrise 2 Jun 2024 10:02 UTC
  3 points
  2
  Parent
  The following comments by @Said Achmiz are relevant here: 1, 2, 3; as well as @johnswentworth’s post from 5 years ago on “Why Subagents?”
  In particular, the most direct answer to your question is the following:
  Paperclip Minimizer: How can an agent not conform the completeness axiom ? It literally just say “either the agent prefer A to B, or B to A, or don’t prefer anything”. Offer me an example of an agent that don’t conform to the completeness axiom.
  Said Achmiz: This turns out to be an interesting question.
  One obvious counterexample is simply an agent whose preferences are not totally deterministic; suppose that when choosing between A and B (though not necessarily in other cases involving other choices), the agent flips a coin, preferring A if heads, B otherwise (and thenceforth behaves according to this coin flip). However, until they actually have to make the choice, they have no preference. How do you propose to construct a Dutch book for this agent? Remember, the agent will only determine their preference after being provided with your offered bets.
  A less trivial example is the case of bounded rationality. Suppose you want to know if I prefer A to B. However, either or both of A/B are outcomes that I have not considered yet. Suppose also (as is often the case in reality) that whenever I do encounter this choice, I will at once perceive that to fully evaluate it would be computationally (or otherwise cognitively) intractable given the limitations of time and other resources that I am willing to spend on making this decision. I will therefore rely on certain heuristics (which I have inherited from evolution, from my life experiences, or from god knows where else), I will consider certain previously known data, I will perhaps spend some small amount of time/effort on acquiring information to improve my understanding of A and B, and then form a preference.
  My preference will thus depend on various contingent factors (what heuristics I can readily call to mind, what information is easily available for me to use in deciding, what has taken place in my life up to the point when I have to decide, etc.). Many, if not most, of these contingent factors, are not known to you; and even were they known to you, their effects on my preference are likely to be intractable to determine. You therefore are not able to model me as an agent whose preferences are complete. (We might, at most, be able to say something like “Omega, who can see the entire manifold of existence in all dimensions and time directions, can model me as an agent with complete preferences”, but certainly not that you, nor any other realistic agent, can do so.)
  Finally, “Expected Utility Theory without the Completeness Axiom” (Dubra et. al., 2001) is a fascinating paper that explores some of the implications of completeness axiom violation in some detail. Key quote:
  “Before stating more carefully our goal and the contribution thereof, let us note that there are several economic reasons why one would like to study incomplete preference relations. First of all, as advanced by several authors in the literature, it is not evident if completeness is a fundamental rationality tenet the way the transitivity property is. Aumann (1962), Bewley (1986) and Mandler (1999), among others, defend this position very strongly from both the normative and positive viewpoints. Indeed, if one takes the psychological preference approach (which derives choices from preferences), and not the revealed preference approach, it seems natural to define a preference relation as a potentially incomplete preorder, thereby allowing for the occasional “indecisiveness” of the agents. Secondly, there are economic instances in which a decision maker is in fact composed of several agents each with a possibly distinct objective function. For instance, in coalitional bargaining games, it is in the nature of things to specify the preferences of each coalition by means of a vector of utility functions (one for each member of the coalition), and this requires one to view the preference relation of each coalition as an incomplete preference relation. The same reasoning applies to social choice problems; after all, the most commonly used social welfare ordering in economics, the Pareto dominance, is an incomplete preorder. Finally, we note that incomplete preferences allow one to enrich the decision making process of the agents by providing room for introducing to the model important behavioral traits like status quo bias, loss aversion, procedural decision making, etc.”
  I encourage you to read the whole thing (it’s a mere 13 pages long).
  What links here?
  - sunwillrise's comment on Alignment: “Do what I would have wanted you to do” by Oleg Trott (13 Jul 2024 1:07 UTC; 49 points)
  - johnswentworth 2 Jun 2024 18:44 UTC
    2 points
    0
    Parent
    If you’re going to link Why Subagents?, you should probably also link Why Not Subagents?.
    - sunwillrise 2 Jun 2024 18:48 UTC
      1 point
      0
      Parent
      It’s linked in the edit at the top of my post.
  - Oskar Mathiasen 2 Jun 2024 13:43 UTC
    −1 points
    −3
    Parent
    The arguments in the Aumann paper in favor of dropping the completeness axiom is that it makes for a better theory of Human/Buisness/Existent reasoning, not that it makes for a better theory of ideal reasoning.
    The paper seems to prove that any partial preference ordering which obeys the other axioms must be representable by a utility function, but that there will be multiple such representatives.
    My claim is that either there will be a dutch book, or your actions will be equivalent to the actions you would have taken by following one of those representative utility functions, in which case even though the internals don’t seem like following a utility function they are for the purposes of VNM.
    But demonstrating this is hard, as it is unclear what actions correspond to the fact that A is incomparable to B.
    The concrete examples of non complete agents in the above, either seem like they will act according to one of those representatives, or like they are easily dutch bookable.
    - sunwillrise 2 Jun 2024 14:58 UTC
      7 points
      6
      Parent
      The arguments in the Aumann paper in favor of dropping the completeness axiom is that it makes for a better theory of Human/Buisness/Existent reasoning, not that it makes for a better theory of ideal reasoning.
      The words you’ve written here might seem coherent on a superficial reading, but the more you think about them, the less sense they make.
      “Ideal reasoning”, as you are using it, is a red herring. The process of reasoning we are interested in when we deal with the type of agentic set-up in front of us is one in which reason is a means to an end, namely the fulfillment of preferences, as can be viewed through the (purportedly sensible) maximization of the utility function. It does not act to constrain what those preferences must be, except in so far as (in a real-world setting, for instance) they allow the agent to attempt self-modification to avoid the loss of resources to scenarios like money-pumping due to violations of transitivity. Moreover, an agent is not required to self-modify in order to avoid circular-type inconsistencies; if it determines it will not be faced with money-pumping scenarios in real life, it can very well decide to not waste resources for an ultimately useless self-modification. That is to say, these types of inefficiencies in the preference ranking are necessary but not sufficient conditions for problems to appear. As such, incomplete preferences are no more violations of “ideal reasoning” than preferring black cats to orange cats are; it’s simply something (close) to orthogonal to reasoning (i.e., optimization) processes.
      Now, let’s consider what Aumann actually wrote in his 1962 paper:
      Of all the axioms of utility theory, the completeness axiom is perhaps the most questionable.8 Like others of the axioms, it is inaccurate as a description of real life; but unlike them, we find it hard to accept even from the normative viewpoint. Does “rationality” demand that an individual make definite preference comparisons between all possible lotteries (even on a limited set of basic alternatives)? For example, certain decisions that our individual is asked to make might involve highly hypothetical situations, which he will never face in real life; he might feel that he cannot reach an “honest” decision in such cases. Other decision problems might be extremely complex, too complex for intuitive “insight,” and our individual might prefer to make no decision at all in these problems.9 Or he might be willing to make rough preference statements such as, “I prefer a cup of cocoa to a 75-25 lottery of coffee and tea, but reverse my preference if the ratio is 25-75″; but he might be unwilling to fix the break-even point between coffee-tea lotteries and cocoa any more precisely.l1 Is it “rational” to force decisions in such cases?
      Yes, this makes reference to humans, but that is for illustrative purposes only; as Aumann notes, humans do not satisfy completeness, just as they don’t satisfy the other axioms of VNM theory. The relevant question is whether there is any fundamental rule of rationality that says they ought to, and as described above, there is not.
      The paper seems to prove that any partial preference ordering which obeys the other axioms must be representable by a utility function, but that there will be multiple such representatives.
      This is true but misleading as written, because your writing does not explain what “utility function” means in this new context. It is not the same as the utility function described in the original question post, due to the fact that it has a different type. Quoting from Aumann again:
      Fortunately, it turns out that much of utility theory stays intact even when the completeness axiom is dropped. However, there is a price to pay. We still get a utility function u that satisfies the expected utility hypothesis (item (b) above); and u still “represents” the preference order (item (a) above), but now in a weaker sense: as before, if x is preferred to y then u(x) > u(y), but the opposite implication is no longer true. Indeed, since the real numbers are completely ordered and our lottery space is only partially ordered, the opposite implication could not possibly be true. Furthermore, we no longer have uniqueness of the utility.
      The fact that the utilities are no longer unique is only a small and unimportant (actually trivial, see footnote 13 on page 448 of Aumann) part of the modification. What is far more important is that there is no longer a unique maximum of the utility function; indeed, the partial ordering gives only a potentially infinite set of maximal elements (world-states, universe-histories, etc) that are incomparable with one another. This breaks most of the intuitions that come about from working with regular (VNM-style, or coming from the Complete Class Theorem) utility functions, as optimization is now a much (infinitely times more) broader process that can converge on any of those end states. Since the territory is constrained to a much lesser extent, we have much less information about the agent from this analysis.
      The concrete examples of non complete agents in the above, either seem like they will act according to one of those representatives, or like they are easily dutch bookable.
      As explained above, “acting according to a representative” now becomes only a very modest constraint on the behavior of such an agent.
      But demonstrating this is hard, as it is unclear what actions correspond to the fact that A is incomparable to B.
      I’ll try illustrating it again, from a different tack this time. The relevant part is Section 5 of “The Shutdown Problem: Incomplete Preferences as a Solution”, incidentally also by EJT. What is of particular importance here are the “preferential gaps” he explains as follows:
      Preferential Gap
      An agent has a preferential gap between lottery $X$ and lottery $Y$ iff
      (1) it lacks a preference between $X$ and $Y$ , and
      (2) this lack of preference is insensitive to some sweetening or souring.
      Here clause (2) means that the agent also lacks a preference between $X$ and some sweetening or souring of $Y$ , or lacks a preference between $Y$ and some sweetening or souring of $X$ .
      Consider an example. You likely have a preferential gap between some career as an accountant and some career as a clown.^[8] There is some pair of salaries $ $m$ and $ $n$ you could be offered for those careers such that you lack a preference between the two careers, and you’d also lack a preference between those careers if the offers were instead $ $m + 1$ and $ $n$ , or $ $m - 1$ and $ $n$ , or $ $m$ and $ $n + 1$ , or $ $m$ and $ $n - 1$ . Since your lack of preference is insensitive to at least one of these sweetenings and sourings, you have a preferential gap between those careers at salaries $ $m$ and $ $n$ .^[9]
      Sami Petersen has explained in full, formal detail how it is not true that agents built with preferential gaps play dominated strategies. In a sense, this is a formalization of some of what Said explained in the comment I linked earlier.
      What links here?
      sunwillrise's comment on Alignment: “Do what I would have wanted you to do” by Oleg Trott (13 Jul 2024 1:07 UTC; 49 points)

[Question] What do coherence arguments actually prove about agentic behavior?

Classic Theorems

How To Talk About “Powerful Agents” Directly

Pareto-Optimality/​Dominated Strategies

Approximate Coherence

‘I define ‘coherence theorems’ differently.’

‘The mistake is benign.’

Two important points

So why does the post matter?

Pareto-Optimality/Dominated Strategies