Eliezer Yudkowsky comments on AGI Ruin: A List of Lethalities

Eliezer Yudkowsky 9 Jun 2022 19:18 UTC
LW: 0 AF: 13
−9
AF
Well, my disorganized list sure wasn’t complete, so why not go ahead and list some of the foreseeable difficulties I left out? Bonus points if any of them weren’t invented by me, though I realize that most people may not realize how much of this entire field is myself wearing various trenchcoats.
- evhub 9 Jun 2022 20:30 UTC
  LW: 290 AF: 75
  79
  AF Parent
  Sure—that’s easy enough. Just off the top of my head, here’s five safety concerns that I think are important that I don’t think you included:
  - The fact that there exist functions that are easier to verify than satisfy ensures that adversarial training can never guarantee the absence of deception.
  - It is impossible to verify a model’s safety—even given arbitrarily good transparency tools—without access to that model’s training process. For example, you could get a deceptive model that gradient hacks itself in such a way that cryptographically obfuscates its deception.
  - It is impossible in general to use interpretability tools to select models to have a particular behavioral property. I think this is clear if you just stare at Rice’s theorem enough: checking non-trivial behavioral properties, even with mechanistic access, is in general undecidable. Note, however, that this doesn’t rule out checking a mechanistic property that implies a behavioral property.
  - Any prior you use to incentivize models to behave in a particular way doesn’t necessarily translate to situations where that model itself runs another search over algorithms. For example, the fastest way to search for algorithms isn’t to search for the fastest algorithm.
  - Even if a model is trained in a myopic way—or even if a model is in fact myopic in the sense that it only optimizes some single-step objective—such a model can still end up deceiving you, e.g. if it cooperates with other versions of itself.
  - Eliezer Yudkowsky 10 Jun 2022 5:15 UTC
    LW: 138 AF: 35
    71
    AF Parent
    Consider my vote to be placed that you should turn this into a post, keep going for literally as long as you can, expand things to paragraphs, and branch out beyond things you can easily find links for.
    (I do think there’s a noticeable extent to which I was trying to list difficulties more central than those, but I also think many people could benefit from reading a list of 100 noncentral difficulties.)
    - DPiepgrass 21 Jul 2022 5:29 UTC
      16 points
      17
      Parent
      I do think there’s a noticeable extent to which I was trying to list difficulties more central than those
      Probably people disagree about which things are more central, or as evhub put it:
      Every time anybody writes up any overview of AI safety, they have to make tradeoffs [...] depending on what the author personally believes is most important/relevant to say
      Now FWIW I thought evhub was overly dismissive of (4) in which you made an important meta-point:
      EY: 4. We can’t just “decide not to build AGI” because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit—it does not lift it [...]
      evhub: This is just answering a particular bad plan.
      But I would add a criticism of my own, that this “List of Lethalities” somehow just takes it for granted that AGI will try to kill us all without ever specifically arguing that case. Instead you just argue vaguely in that direction, in passing, while making broader/different points:
      an AGI strongly optimizing on that signal will kill you, because the sensory reward signal was not a ground truth about alignment (???)
      All of these kill you if optimized-over by a sufficiently powerful intelligence, because they imply strategies like ‘kill everyone in the world using nanotech to strike before they know they’re in a battle, and have control of your reward button forever after’. (I guess that makes sense)
      If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. (???)
      Perhaps you didn’t bother because your audience is meant to be people who already believe this? I would at least expect to see it in the intro: “-5. unaligned superintelligences tend to try to kill everyone, here’s why <link>.… −4. all the most obvious proposed solutions to (-5) don’t work, here’s why <link>”.
      - homosapien97 29 Aug 2025 4:49 UTC
        0 points
        −9
        Parent
        Attempting to destroy anything with non-epsilon probability of preventing you from maximally satisfying your current utility function (such as humans, which might shut you down or modify your utility function in the extreme case) is one of the first instrumentally convergent strategies I thought of, and I’d never heard of instrumentally convergent strategies before today. Seems reasonable for EY to assume.
  - TekhneMakre 10 Jun 2022 6:41 UTC
    10 points
    7
    Parent
    (Note that these have a theme: you can’t wrangle general computation / optimization. That’s why I’m short universal approaches to AI alignment (approaches that aim at making general optimization safe by enforcing universal rules), and long existential approaches (approaches that try to find specific mechanisms that can be analytically seen to do the right thing).)
  - espoire 2 Dec 2024 8:50 UTC
    1 point
    −5
    Parent
    Question: is it actually possible to cryptographically obfuscate goals/plans/precomittments (or other elements of “deception” as mentioned in your second bullet point) against something inspecting your algorithm and working memory?
    
    In particular, it seems to me that anything properly able to inspect you can, at a minimum, discover the use of encryption, and find where the encryption key is read from, and then read the encrypted content. Or fail the model once encryption is detected at all.
    
    I’m not saying it will be easy to build such transparency tools—in fact, that must be a herculean challenge, since selecting against interpreted thoughts selects for uninterpretable thoughts, and a superintelligence might use principles not yet known to humanity—I just mean that, from the perspective of the AI itself, I think perfect obfuscation against ideal logically-possible interpretability tools might be impossible.
    
    Edit:
    
    Nevermind, I answered my own question. There is no such logically-possible interpretability tool. For any such tool we logically-counterfactually assume could exist, then the AI could implement that tool, and utilize encryption if and only if that tool asserts that the AI doesn’t use encryption. …therefore no algorithm exists which can verify the use of encryption in the general case.
- Remmelt 15 Jun 2022 12:07 UTC
  −1 points
  0
  Parent
  Eliezer:
  If you find that (for reasons still left explained)
  - … selection of code for intentionality is coupled – over the long run, in mostly non-reverse-engineerable ways – to various/most of the physical/chemical properties
  - … of the molecular substrate through which discrete code is necessarily computed/expressed (via input and output channels of information/energy packet transmission),
  then given that
  - … the properties of the solid-state substrate (e.g. silicon-based hardware) computing AGI’s code
  - … differ from the properties of the substrate of humans (carbon-based wetware),
  a conclusion that follows is that
  - … the intentionality being selected for in AGI over the long run
  - … will diverge from the intentionality that was selected for in humans.
  What links here?
  - Remmelt's comment on The inordinately slow spread of good AGI conversations in ML by Rob Bensinger (9 Jul 2022 14:47 UTC; 3 points)
  - Rob Bensinger 16 Jun 2022 8:40 UTC
    2 points
    −1
    Parent
    What do you mean by ‘intentionality’? Per SEP, “In philosophy, intentionality is the power of minds and mental states to be about, to represent, or to stand for, things, properties and states of affairs.” So I read your comment as saying, a la Searle, ‘maybe AI can never think like a human because there’s something mysterious and crucial about carbon atoms in particular, or about capital-b Biology, for doing reasoning.’
    This seems transparently silly to me—I know of no reasonable argument for thinking carbon differs from silicon on this dimension—and also not relevant to AGI risk. You can protest “but AlphaGo doesn’t really understand Go!” until the cows come home, and it will still beat you at Go. You can protest “but you don’t really understand killer nanobots!” until the cows come home, and superintelligent Unfriendly AI will still build the nanobots and kill you with them.
    By the same reasoning, Searle-style arguments aren’t grounds for pessimism either. If Friendly AI lacks true intentionality or true consciousness or whatever, it can still do all the same mechanistic operations, and therefore still produce the same desirable good outcomes as if it had human-style intentionality or whatver.
    What links here?
    Remmelt's comment on The inordinately slow spread of good AGI conversations in ML by Rob Bensinger (9 Jul 2022 14:47 UTC; 3 points)
    - Remmelt 16 Jun 2022 10:01 UTC
      1 point
      0
      Parent
      
      So I read your comment as saying, a la Searle, ‘maybe AI can never think like a human because there’s something mysterious and crucial about carbon atoms in particular, or about capital-b Biology, for doing reasoning.’
      
      That’s not the argument. Give me a few days to write a response. There’s a minefield of possible misinterpretations here.
      
      whatever, it can still do all the same mechanistic operations, and therefore still produce the same desirable good outcomes as if it had human-style intentionality or whatver.
      
      However, the argumentation does undermine the idea that designing for mechanistic (alignment) operations is going to work. I’ll try and explain why.
      - Remmelt 16 Jun 2022 12:41 UTC
        2 points
        0
        Parent
        BTW, with ‘intentionality’, I meant something closer to everyday notions of ‘intentions one has’. Will more precisely define that meaning later.
        
        I should have checked for diverging definitions from formal fields. Thanks for catching that.
      - Remmelt 16 Jun 2022 10:42 UTC
        2 points
        0
        Parent
        If you happen to have time, this paper serves as useful background reading: https://royalsocietypublishing.org/doi/full/10.1098/rsif.2012.0869
        
        Particularly note the shift from trivial self-replication (e.g. most computer viruses) to non-trivial self-replication (e.g. as through substrate-environment pathways to reproduction).
        
        None of this is sufficient for you to guess what the argumentation is (you might be able to capture a bit of it, along with a lot of incorrect and often implicit assumptions we must dig into).
        
        If you could call on some patience and openness to new ideas, I would really appreciate it! I am already bracing for a next misinterpretation (which is fine, if we can talk about that). I apologise for that I cannot find a viable way yet to throw out all the argumentation in one go, and also for that this will get a bit disorientating when we go through arguments step-by-step.
      - Remmelt 19 Jun 2022 10:19 UTC
        1 point
        0
        Parent
        Returning to this:
        Give me a few days to write a response. There’s a minefield of possible misinterpretations here.
        
        Key idea: Different basis of existence→ different drives→ different intentions→ different outcomes.
        
        @Rob, I wrote up a longer explanation here, which I prefer to discuss with you in private first. Will email you a copy ~~tomorrow~~ in the next weeks.