Jeremy Gillen comments on Foom & Doom 2: Technical alignment is hard

Jeremy Gillen 25 Jun 2025 20:07 UTC
LW: 10 AF: 5
0
AF
(Overall I like these posts in most ways, and especially appreciate the effort you put into making a model diff with your understanding of Eliezer’s arguments)
Eliezer and some others, by contrast, seem to expect ASIs to behave like a pure consequentialist, at least as a strong default, absent yet-to-be-invented techniques. I think this is upstream of many of Eliezer’s other beliefs, including his treating corrigibility as “anti-natural”, or his argument that ASI will behave like a utility maximizer.
It feels like you’re rounding off Eliezer’s words in a way that removes the important subtlety. What you’re doing here is guessing at the upstream generator of Eliezer’s conclusions, right? As far as I can see in the links, he never actually says anything that translates to “I expect all ASI preferences to be over future outcomes”? It’s not clear to me that Eliezer would disagree with “impure consequentialism”.
I think you get closest to an argument that I believe with (2):
(2) The Internal Competition Argument: We’ll wind up with pure-consequentialist AIs (absent some miraculous technical advance) because in the process of reflection within the mind of any given impure-consequentialist AI, the consequentialist preferences will squash the non-consequentialist preferences.
Where I would say it differently, like: An AI that has a non-consequentialist preference against personally committing the act of murder won’t necessarily build its successor to have the same non-consequentialist preference^[1], whereas an AI that has a consequentialist preference for more human lives will necessarily build its successor to also want more human lives. Non-consequentialist preferences need extra machinery in order to be passed on to successors. (And building successors is a similar process to self-modification).
As another example, I’ve seen people imagine non-consequentialist preferences as “rules that the AI grudgingly follows, while searching for loopholes”, rather than “preferences that the AI enthusiastically applies its intelligence towards pursuing”.
I think you’re misrepresenting/misunderstanding the argument people are making here. Even when you enthusiastically apply your intelligence toward pursuing a deontological constraint (alongside other goals), you implicitly search for “loopholes” in that constraint, i.e. weird ways to achieve all of your goals that don’t involve violating the constraint. To you, they aren’t loopholes, they’re clever ways to achieve all goals.
1. ^
  Perhaps this feels intuitively incorrect. If so, I claim that’s because your preferences against committing murder are supported by a bunch of consequentialist preferences for avoiding human suffering and death. A real non-consequentialist preference is more like the disgust reaction to e.g. picking up insects. Maybe you don’t want to get rid of your own disgust reaction, but you’re okay finding (or building) someone else to pick up insects for you if that helps you achieve your goals. And if it became a barrier to achieving your other goals, maybe you would endorse getting rid of your disgust reaction.
- Steven Byrnes 27 Jun 2025 17:29 UTC
  LW: 9 AF: 6
  0
  AF Parent
  Thanks!
  Hmm, here’s a maybe-interesting example (copied from other comment):
  If an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible to me.
  What’s happening is that this example is in the category “I want the world to continuously retain a certain property”. That’s a non-indexical desire, so it works well with self-modification and successors. But it’s also not-really-consequentialist, in the sense that it’s not (just) about the distant future, and thus doesn’t imply instrumental convergence (or at least doesn’t imply every aspect of instrumental convergence at maximum strength).
  (This is a toy example to illustrate a certain point, not a good AI motivation plan all-things-considered!)
  Speaking of which, is it possible to get stability w.r.t. successors and self-modification while retaining indexicality? Maybe. I think things like “I want to be virtuous” or “I want to be a good friend” are indexical, but I think we humans kinda have an intuitive notion of “responsibility” that carries through to successors and self-modification. If I build a robot to murder you, then I didn’t pull the trigger, but I was still being a bad friend. Maybe you’ll say that this notion of “responsibility” allows loopholes, or will collapse upon sufficient philosophical understanding, or something? Maybe, I dunno. (Or maybe I’m just mentally converting “I want to be a good friend” into the non-indexical “I want you to continuously thrive”, which is in the category of “I want the world to continuously retain a certain property” mentioned above?) I dunno, I appreciate the brainstorming.
  - Jeremy Gillen 4 Jul 2025 7:54 UTC
    LW: 6 AF: 3
    0
    AF Parent
    “I want the world to continuously retain a certain property”. That’s a non-indexical desire, so it works well with self-modification and successors.
    I agree that goals like this work well with self-modification and successors. I’d be surprised if Eliezer didn’t. My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It’s strawmanning. And it isn’t supported by any of the links you cite. I think you must have some mistaken assumption about Eliezer’s views that is leading you to infer that he believes AIs must only have preferences over the distant future. But I can’t tell what it is. One guess is: to you, corrigibility only looks hard/unnatural if preferences are very strictly about the far future, and otherwise looks fairly easy.
    But it’s also not-really-consequentialist, in the sense that it’s not (just) about the distant future, and thus doesn’t imply instrumental convergence (or at least doesn’t imply every aspect of instrumental convergence at maximum strength).
    I would still call those preferences consequentialist, since the consequences are the primary factor that determines the actions. I.e. the behaviour is complicated, but in a way that easy to explain once you know what the behaviour is aimed at achieving. They’re even approximately long-term consequentialist, since the actions are (probably?) mostly aimed at the long-term future. The strict definition you call “pure consequentialism” is a good approximation or simplification of this, under some circumstances, like when value adds up over time and therefore the future is a bigger priority than the immediate present.
    No one I know has argued that AI or rational people can only care about the distant future. People spend money to visit a theme park sometimes, in spite of money being instrumentally convergent.
    Maybe you’ll say that this notion of “responsibility” allows loopholes, or will collapse upon sufficient philosophical understanding, or something? Maybe, I dunno.
    Some versions of that does have loopholes, but overall I think I agree that you could get a lot of stability that way. (But as far as I can tell, the versions with fewer loopholes look more like consequence-based goals rather than rules that say which kinds of local actions-sequences are good and bad).
    (Or maybe I’m just mentally converting “I want to be a good friend” into the non-indexical “I want you to continuously thrive”, which is in the category of “I want the world to continuously retain a certain property” mentioned above?)
    Yeah this is exactly what I had an issue with in my sibling discussion with Ryan. He seems to think {integrity,honesty,loyalty} are deontological, whereas the way they are implemented in me is as a mix of consequentialist reasoning (e.g. some components are “does this person end up better off, by their own lights?”, “do they understand what I’m doing and why?”) and a bunch of soft rules designed to reduce the chances that I accidentally rationalise actions that are ultimately hurtful for complicated reasons that are difficult to see in the moment (e.g. “in the course of my plan, don’t cross privacy boundaries that likely lead me to gain information that they might not have felt comfortable with me knowing”). But the rules aren’t a primary driver of action, they are relatively weak constraints that quickly rule out bad plans (that almost always would have been bad for consequentialist reasons).
    For me, it’s similar when I want to be a good friend.
    What links here?
    Foom & Doom 2: Technical alignment is hard by Steven Byrnes (23 Jun 2025 17:19 UTC; 152 points)
    Steven Byrnes's comment on Foom & Doom 2: Technical alignment is hard by Steven Byrnes (9 Jul 2025 17:03 UTC; 7 points)
    - Steven Byrnes 8 Jul 2025 16:22 UTC
      LW: 6 AF: 3
      2
      AF Parent
      My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It’s strawmanning. And it isn’t supported by any of the links you cite.
      For the record, my OP says something weaker than that—I wrote “Eliezer and some others…seem to expect ASIs to behave like a pure consequentialist, at least as a strong default…”.
      Maybe this is a pointless rabbit’s hole, but I’ll try one more time to argue that Eliezer seems to have this expectation, whether implicitly or explicitly, and whether justified or not:
      For example, look at Eliezer’s Coherent decisions imply consistent utilities, and then reflect on the fact that knowing that an agent is “coherent”, a.k.a. a “utility maximizer”, tells you nothing at all about its behavior, unless you make additional assumptions about the domain of its utility function (e.g. that the domain is ‘the future state of the world’). To me it seems clear that
      Either Eliezer is making those “additional assumptions” without mentioning them in his post, which supports my claim that pure-consequentialism is (to him) a strong default;
      Or his post is full of errors, because for example he discusses whether an AI will be “visibly to us humans shooting itself in the foot”, when in fact it’s fundamentally impossible for an external observer to know whether an agent is being incoherent / self-defeating or not, because (again) coherent utility-maximizing behaviors include absolutely every possible sequence of actions.
      - Jeremy Gillen 9 Jul 2025 11:36 UTC
        LW: 5 AF: 3
        0
        AF Parent
        Sorry if I misrepresented you, my intended meaning matches what you wrote. I was trying to replace “pure consequentialist” with its definition to make it obvious that it’s a ridiculously strong expectation that you’re saying Eliezer and others have.
        Yes, assumptions about the domain of the utility function are needed in order to judge its behaviour as coherent or not. Rereading Coherent decisions imply consistent utilities, Eliezer is usually clear about the assumed domain of the utility function in each thought experiment. For example, he’s very clear here that you need the preferences as an assumption:
        Have we proven by pure logic that all apples have the same utility? Of course not; you can prefer some particular apples to other particular apples. But when you’re done saying which things you qualitatively prefer to which other things, if you go around making tradeoffs in a way that can be viewed as not qualitatively leaving behind some things you said you wanted, we can view you as assigning coherent quantitative utilities to everything you want.
        And that’s one coherence theorem—among others—that can be seen as motivating the concept of utility in decision theory.”
        In the hospital thought experiment, he specifies the goal as an assumption:
        Robert only cares about maximizing the total number of lives saved. Furthermore, we suppose for now that Robert cares about every human life equally.
        In the pizza example, he doesn’t specify the domain, but it’s fairly obvious implicitly. In the fruit example, it’s also implicit but obvious.
        There’s a few paragraphs at the end of the Allias paradox section about the (very non-consequentialist) goal of feeling certain during the decision-making process. I don’t get the impression from those paragraphs that Eliezer is saying that this preference is ruled out by any implicit assumption. In fact he explicitly says that this preference isn’t mathematically improper. It seems he’s saying this kind of preference cuts against coherence only if it’s getting in the way of more valuable decisions:
        ‘The danger of saying, “Oh, well, I attach a lot of utility to that comfortable feeling of certainty, so my choices are coherent after all” is not that it’s mathematically improper to value the emotions we feel while we’re deciding. Rather, by saying that the most valuable stakes are the emotions you feel during the minute you make the decision, what you’re saying is, “I get a huge amount of value by making decisions however humans instinctively make their decisions, and that’s much more important than the thing I’m making a decision about.” This could well be true for something like buying a stuffed animal. If millions of dollars or human lives are at stake, maybe not so much.’
        I think this quote in particular invalidates your statements.
        There is a whole stack of assumptions^[1] that Eliezer isn’t explicit about in that post. It’s intended to give a taste of the reasoning that gives us probability and expected utility, not the precise weakest set of assumptions required to make a coherence argument work.
        I think one thing that is missing from that post are the reasons we usually do have prior knowledge of goals (among humans and for predicting advanced AI). Among humans we have good priors that heavily restrict the goal-space, plus introspection and stated preferences as additional data. For advanced AI, we can usually use usefulness (on some specified set of tasks) and generality (across a very wide range of potential obstacles) to narrow down the goal-domain. Only after this point, and with a couple of other assumptions, do we apply coherence arguments to show that it’s okay to use EUM and probability.
        The reason I think this is worth talking about is that I was actively confused about exactly this topic in the year or two before I joined Vivek’s team. Re-reading the coherence and advanced agency cluster of Arbital posts (and a couple of comments from Nate) made me realise I had misinterpreted them. I must have thought they were intended to prove more than they do about AI risk. And this update flowed on to a few other things. Maybe partially because the next time I read Eliezer as saying something that seemed unreasonably strong I tried to steelman it and found a nearby reasonable meaning. And also because I had a clearer idea of the space of agents that are “allowed”, and this was useful for interpreting other arguments.
        I’d be happy to call if that’s a more convenient way to talk, although it is nice to do this publicly. Also completely happy to stop talking about this if you aren’t interested, since I think your object-level beliefs about this ~match mine (“impure consequentialism” is expected of advanced AI).
        ^
        E.g. I think we need a bunch of extra structure about self-modification to apply anything like a money pump argument to resolute/updateless agents. I think we need some non-trivial arguments and an assumption to make the VNM continuity money pump work. I remember there being some assumption that went into complete class that I thought was non-obvious, but I’ve forgotten exactly what it was. The post is very clear that it’s just giving a few tastes of the kind of reasoning needed to pin down utility and probability as a reasonable model of advanced agents.
        Steven Byrnes 9 Jul 2025 17:03 UTC
        LW: 7 AF: 3
        2
        AF Parent
        Probably not worth the time to further discuss what certain other people do or don’t believe, as opposed to what’s true. I remain unconvinced but added a caveat to the article just to be safe:
        Why do Eliezer and others expect pure consequentialism? [UPDATE: …Or if I’m misreading Eliezer, as one commenter claims I am, replace that by: “Why might someone expect pure consequentialism?”]
        Towards_Keeperhood 19 Sep 2025 10:02 UTC
        LW: 1 AF: 1
        0
        AF Parent
        I want to note that it seems to me that Jeremy is trying to argue you out of the same mistake I tried to argue you out in this thread.
        The problem is that you use “consequentialism” different than Eliezer means it. I suppose he only used the word in a couple of occasions where he tried to get accross the basic underlying model without going into excessive details, and it may read to you like your “far futue outcome pumping” matches your definitions there (though back when I looked over your cited support that Eliezer means it, it didn’t seem at all like the evidence points to this interpretation). But if you get a deep understanding of logical decision theory, or you study a lot of MIRI papers where they (where the utility of agents is iirc always over trajectories of the environment program^[1]), you see what Eliezer’s deeper position is.
        Probably not worth the time to further discuss what certain other people do or don’t believe, as opposed to what’s true.
        I think you’re strawmanning Eliezer and propagating a wrong understanding of what “consequentialism” was supposed to refer to, and this seems like an important argument to have separately from what’s true. But a good point that we should distinguish arguing about this from arguing about what’s true.
        Going forward, I suggest you use another word like “farfuturepumping” instead of “consequentialism”. (I’ll also use another word for Eliezer::consequentialism and clarify it, since it’s apparently often misunderstood.)
        As quick summary, which may not be easily understandable due to inferential distance, I think that me and Eliezer both think that:
        Smart AIs will be utility optimizing, but this utility is over computations/universe-trajectories, not future states.
        This is a claim about how AI cognition will look like, not just about that its behavior will be coherent according to some utility function. Smart AIs will think in some utility-maximizing ways, even though it initially may be quite a mess where it’s really hard to read off what values are being optimized for, and the values may change a bit as the AI changes.
        Coherence arguments only imply that a coherent agent will behave as if they optimized a utility function, not about what cognitive algorithm the agent uses. There’s an extra step needed to get to cognitive utility maximization, and AFAIK it hasn’t been explained well anywhere, but maybe it’s sorta intuitive?
        It’s perfectly alright to have non-farfuturepumping preferences like you describe, but just saying it’s possible isn’t enough, you actually need to write down the utility function over universe-trajectories.
        This is because if you just say “well it’s possible, so there”, you may fail to think concretely enough to see how a utility function that has the properties you imagine would actually be quite complex, and thus unlikely to be learned.
        Why can’t you have a utility function but also other preferences?
        There needs to be some tradeoff between the utility function and the other preferences, and however you choose this the result can be formalized as utility function. If you don’t do this you can engage in abstract wishful thinking where you can imagine a different tradeoff for different cases and thereby delude yourself about your proposal robustly working.
        Why can’t you just specify that in some cases utility function u1 should be used, and in others u2 should be used?
        Because when u1 is used then there’s an instrumental incentive to modify the code of the AI s.t. always u1 is used. You want reflective consistency to avoid such problems.
        I would recommend you to chat with Jeremy (and maybe reread our comment thread).
        ^
        Yes utility is often formalized over the space of outcomes, but the space of outcomes is iirc the space of trajectories.
        What links here?
        Steven Byrnes's comment on Foom & Doom 2: Technical alignment is hard by Steven Byrnes (19 Sep 2025 19:48 UTC; 2 points)
        Steven Byrnes 19 Sep 2025 13:40 UTC
        LW: 7 AF: 6
        1
        AF Parent
        My read of this conversation is that we’re basically on the same page about what’s true but disagree about whether Eliezer is also on that same page too. Again, I don’t care. I already deleted the claim about what Eliezer thinks on this topic, and have been careful not to repeat it elsewhere.
        Since we’re talking about it, my strong guess is that Eliezer would ace any question about utility functions and what’s their domain and when is “utility-maximizing behavior” vacuous, … if asked directly.
        But it’s perfectly possible to “know” something when asked directly, but also to fail to fully grok the consequences of that thing and incorporate it into some other part of one’s worldview. God knows I’m guilty of that, many many times over!
        Thus my low-confidence guess is that Eliezer is guilty of that too, in that the observation “utility-maximizing behavior per se is vacuous” (which I strongly expect he would agree with if asked directly) has not been fully reconciled with his larger thinking on the nature of the AI x-risk problem.
        (I would further add that, if Eliezer has fully & deeply incorporated “utility-maximizing behavior per se is vacuous” into every other aspect of his thinking, then he is bad at communicating that fact to others, in the sense that a number of his devoted readers wound up with the wrong impression on this point.)
        Anyway, I feel like your comment is some mix of “You’re unfairly maligning Eliezer” (again, whatever, I have stopped making those claims) and “You’re wrong that this supposed mistake that you attribute to Eliezer is a path through which we can solve the alignment problem, and Eliezer doesn’t emphasize it because it’s an unimportant dead-end technicality” (maybe! I don’t claim to have a solution to the alignment problem right now; perhaps over time I will keep trying and failing and wind up with a better appreciation of the nature of the blockers).
        Most of your comment is stuff I already agree with (except that I would use the term “desires” in most places that you wrote “utility function”, i.e. where we’re talking about “how AI cognition will look like”).
        I don’t follow what you think Eliezer means by “consequentialism”. I’m open-minded to “farfuturepumping”, but only if you convince me that “consequentialism” is actually misleading. I’m don’t endorse coining new terms when an existing term is already spot-on.
        Towards_Keeperhood 19 Sep 2025 19:00 UTC
        1 point
        0
        Parent
        To me it seems a bit surprising that you say we agree on the object level, when in my view you’re totally guilty of my 2.b.i point above of not specifying the tradeoff / not giving a clear specification of how decisions are actually made.
        I also think the utility maximizer frame is useful, though there are 2 (IMO justified) assumptions that I see as going along with it:
        There’s sth like a simplicity prior over the space of utility functions (because there needs to be some utility maximizing structure implemented in the AI).
        The utility function is a function of the trajectory of the environment. (Or in even better formalization it may take as input a program which is the environment.)
        I think using a learned value function (LVF) that computes valence of thoughts is a worse frame to use for tackling corrigibility because it’s harder to clearly evaluate what actions the agent will end up taking. And because this kind of “imagine some plan and what the outcome would be and let the LVF evaluate that” doesn’t seem to me how smarter than human minds operate—considering what change in the world an action would cause seems more natural than whether some imagined scene seems appealing. Even humans like me move away from the LVF frame, e.g. I’m trying to correct for scope insensitivity of my LVF by doing sth more like explicit expected utility calculations.^[1]
        “You’re wrong that this supposed mistake that you attribute to Eliezer is a path through which we can solve the alignment problem, and Eliezer doesn’t emphasize it because it’s an unimportant dead-end technicality” (maybe! I don’t claim to have a solution to the alignment problem right now; perhaps over time I will keep trying and failing and wind up with a better appreciation of the nature of the blockers).
        I’m more like “Your abstract guesturing didn’t let me see any concrete proposal that would make me more hopeful, and even if good proposals are in that direction it seems to me like most of the work would still be ahead instead of it being like ‘we can just do it sorta like that’ as you seem to present it. But maybe I’m wrong and maybe you have more intuitions and will find a good concrete proposal.”.
        I don’t follow what you think Eliezer means by “consequentialism”. I’m open-minded to “farfuturepumping”, but only if you convince me that “consequentialism” is actually misleading.
        Maybe study logical decision theory? Not sure where to best start but maybe here:
        “Logical decision theories” are algorithms for making choices which embody some variant of “Decide as though you determine the logical output of your decision algorithm.”
        Like consequentialism in the sense of “what’s the consequence of choosing the logical output of your decision algorithm in a particular way”, where consequence here isn’t a time-based event but rather the way the universe looks like conditional on the output of your decision algorithm.
        ^
        I’m not confident those are the only reason why LVF seems worse here, I didn’t fully articulate my intuitions yet.
        Steven Byrnes 19 Sep 2025 19:48 UTC
        2 points
        0
        Parent
        Maybe study logical decision theory?
        Eliezer has always been quite clear that you should one-box for Newcomb’s problem because then you’ll wind up with more money. The starting point for the whole discussion is a consequentialist preference—you have desires about the state of the world after the decision is over.
        You have desires, and then decision theory tells you how to act so as to bring those desires about. The desires might be entirely about the state of the world in the future, or they might not be. Doesn’t matter. Regardless, whatever your desires are, you should use good decision theory to make decisions that will lead to your desires getting fulfilled.
        Thus, decision theory is unrelated to our conversation here. I expect that Eliezer would agree.
        To me it seems a bit surprising that you say we agree on the object level, when in my view you’re totally guilty of my 2.b.i point above of not specifying the tradeoff / not giving a clear specification of how decisions are actually made.
        Your 2.a is saying “Steve didn’t write down a concrete non-farfuturepumping utility function, and maybe if he tried he would get stuck”, and yeah I already agreed with that.
        Your 2.b is saying “Why can’t you have a utility function but also other preferences?”, but that’s a very strange question to me, because why wouldn’t you just roll those “other preferences” into the utility function as you describe the agent? Ditto with 2.c, why even bring that up? Why not just roll that into the agent’s utility function? Everything can always be rolled into the utility function. Utility functions don’t imply anything about behavior, and they don’t imply reflective consistency, etc., it’s all vacuous formalizing unless you put assumptions / constraints on the utility function.
        Towards_Keeperhood 19 Sep 2025 20:52 UTC
        1 point
        0
        Parent
        The purpose of studying LDT would be to realize that the type signature you currently imagine Steve::consequentialist preferences to have is different from the type signature that Eliezer would imagine.
        The starting point for the whole discussion is a consequentialist preference—you have desires about the state of the world after the decision is over.
        You can totally have preferences about the past that are still influenced by your decision (e.g. Parfit’s hitchhiker).
        Decisions don’t cause future states, they influence which worlds end up real vs counterfactual. Preferences aren’t over future states but over worlds—which worlds would you like to be more real?
        AFAIK Eliezer only used the word “consequentialism” in abstract descriptions of the general fact that you (usually) need some kind of search in order to find solutions to new problems. (Like I think just using a new word for what he used to call optimization.) Maybe he also used the outcome pump as an example, but if you asked him what how consequentialist preferences look like in detail, I’d strongly bet he’d say sth like preferences over worlds rather than preferences over states in the far future.
- ryan_greenblatt 25 Jun 2025 22:40 UTC
  LW: 7 AF: 6
  2
  AF Parent
  
  Where I would say it differently, like: An AI that has a non-consequentialist preference against personally committing the act of murder won’t necessarily build its successor to have the same non-consequentialist preference[1], whereas an AI that has a consequentialist preference for more human lives will necessarily build its successor to also want more human lives. Non-consequentialist preferences need extra machinery in order to be passed on to successors.
  
  [...]
  
  Perhaps this feels intuitively incorrect. If so, I claim that’s because your preferences against committing murder are supported by a bunch of consequentialist preferences for avoiding human suffering and death. A real non-consequentialist preference is more like the disgust reaction to e.g. picking up insects. Maybe you don’t want to get rid of your own disgust reaction, but you’re okay finding (or building) someone else to pick up insects for you if that helps you achieve your goals. And if it became a barrier to achieving your other goals, maybe you would endorse getting rid of your disgust reaction.
  
  Hmm, imagine we replace “disgust” with “integrity”. As in, imagine that I’m someone who is strongly into the terminal moral preference of being an honest and high integrity person. I also value loyalty and pointing out ways in which my intentions might differ from what someone wants. Then, someone hires me (as an AI let’s say) and tasks me with building a successor. They also instruct me: ’Make sure the AI successor you build is high integrity and avoids disempowering humans. Also, generalize the notion of “integrity, loyalty, and disempowerment” as needed to avoid these things breaking down under optimization pressure (and get your successors to do the same. And, let me know if you won’t actually do a good job following these instructions, e.g. because you aren’t actually that well aligned. Like, tell me if you wouldn’t actually try hard and please be seriously honest with me about this.”
  
  In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn’t robustly pursue the interests of the developer. That’s not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.
  
  Another way to put this is that the deontological constraints we want are like the human notions of integrity, loyalty, and honesty (and to then instruct the AI that we want this constraints propogated forward). I think an actually high integrity person/AI doesn’t search for loopholes or want to search for loopholes. And the notion of “not actually loopholes” generalizes between different people and AIs I’d claim. (Because notions like “the humans remained in control” and “the AIs stayed loyal” are actually relatively natural and can be generalized.)
  
  I’m not claiming you can necessarily instill these (robust and terminal) deontological preferences, but I am disputing they are similar to non-reflectively endorsed (potentially non-terminal) deontological constraints or urges like disgust. (I don’t think disgust is an example of a deontological constraint, it’s just an obviously unendorsed physical impulse!)
  - Jeremy Gillen 26 Jun 2025 9:55 UTC
    LW: 4 AF: 1
    1
    AF Parent
    In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn’t robustly pursue the interests of the developer. That’s not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.
    Yes, agreed. The extra machinery and assumptions you describe seem sufficient to make sure nonconsequentialist preferences are passed to a successor.
    I think an actually high integrity person/AI doesn’t search for loopholes or want to search for loopholes.
    If I try to condition on the assumptions that you’re using (which I think include a central part of the AIs preferences having a true-but-maybe-approximate pointer toward the instruction-givers preferences, and also involves a desire to defer or at least flag relevant preference differences) then I agree that such an AI would not search for loopholes on the object-level.
    I’m not sure whether you missed the straightforward point I was trying to make about searching for loopholes, or whether you understand it and are trying to point at a more relevant-to-your-models scenario? The straightforward point was that preference-like objects need to be robust to search. Your response reads as “imagine we have a bunch of higher-level-preferences and protective machinery that already are robust to optimisation, then on the object level these can reduce the need for robustness”. This is locally valid.
    I don’t think its relevant because we don’t know how to build those higher-level-preferences and protective machinery in a way that is itself very robust to the OOD push that comes from scaling up intelligence, learning, self-correcting biases, and increased option-space.
    (I don’t think disgust is an example of a deontological constraint, it’s just an obviously unendorsed physical impulse!)
    Some people reflectively endorse their own disgust at picking up insects, and wouldn’t remove it if given the option. I wanted an example of a pure non-consequentialist preference, and I stand by it as a good example.
    deontological constraints we want are like the human notions of integrity, loyalty, and honesty
    Probably we agree about this, but for the sake of flagging potential sources of miscommunication: if I think about the machinery involved in implementing these “deontological” constraints, there’s a lot of consequentialist machinery involved (but it’s mostly shorter-term and more local than normal consequentialist preferences).
    - ryan_greenblatt 26 Jun 2025 14:59 UTC
      LW: 2 AF: 2
      0
      AF Parent
      I was trying to argue that the most natural deontology-style preferences we’d aim for are relatively stable if we actually instill them. So, I think the right analogy is that you either get integrity+loyalty+honesty in a stable way, some bastardized version of them such that it isn’t in the relevant attractor basin (where the AI makes these properties more like what the human wanted), or you don’t get these things at all (possibly because the AI was scheming for longer run preferences and so it faked these things).
      
      And I don’t buy that the loophole argument applies unless the relevant properties are substantially bastardized. I certainly agree that there exist deontological preferences that involve searching for loopholes, but these aren’t the one people wanted. Like, I agree preferences have to be robust to search, but this is sort of straightforwardly true if the way integrity is implemented is at all kinda similar to how humans implement it.
      
      Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation again comes down to “you get scheming”, “your behavioural tests look bad, so you try again”, “your behavioural tests look fine, and you didn’t have scheming, so you probably basically got the properties you wanted if you were somewhat careful”.
      
      As in, I think we can at least test for the higher level preferences we want in the absence of scheming. (In a way that implies they are probably pretty robust given some carefulness, though I think the chance of things going catastropically wrong is still substantial.)
      
      (I’m not sure if I’m communicating very clearly, but I think this is probably not worth the time to fully figure out.)
      
      Personally, I would clearly pass on all of my reflectively endorsed deontological norms to a successor (though some of my norms are conditional on aspects of the situation like my level of intelligence and undetermined at the moment because I haven’t reflected on them, which is typically undesirable for AIs). I find the idea that you would have a reflectively endorsed deontological norm (as in, you wouldn’t self modify to remove it) that you wouldn’t pass on to a successor bizarre: what is your future self if not a successor?
      - Jeremy Gillen 26 Jun 2025 18:01 UTC
        2 points
        0
        Parent
        I was trying to argue that the most natural deontology-style preferences we’d aim for are relatively stable if we actually instill them.
        Trivial and irrelevant though if true-obedience is part of it, since that’s magic that gets you anything you can describe.
        if the way integrity is implemented is at all kinda similar to how humans implement it.
        How do humans implement integrity?
        Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation comes down to “you get scheming”, “your behavioural tests look bad, so you try again”, “your behavioural tests look fine, and you didn’t have scheming, so you probably basically got the properties you wanted if you were somewhat careful”.
        You’re just stating that you don’t expect any reflective instability, as an agent learns and thinks over time? I’ve heard you say this kind of thing before, but haven’t heard an explanation. I’d love to hear your reasoning? In particular since it seems very different from how humans work, and intuitively surprising for any thinking machine that starts out a bit of a hacky mess like us. (I could write out an object-level argument for why reflective instability is expected, but it’d take some effort and I’d want to know that you were going to engage with it).