jdp comments on Varieties Of Doom

jdp 19 Nov 2025 9:51 UTC
22 points
9
ChatGPT still thinks I am wrong so let’s think step by step. Bostrom says (i.e. leads the reader to understand through his gestalt speech, not that he literally says this in one passage) that, in the default case:
1. When you specify your final goal, it is wrong.
2. It is wrong because it is a discrete program representation of a nuanced concept like “happiness” that does not fully capture what we think happiness is.
3. Eventually you will have a world model with a correct understanding of happiness, because the AI is superintelligent.
4. This representation of happiness in the superintelligent world model “understands us” and would presumably produce better results if we could point at that understanding instead.
5. The fact we don’t do this to begin with heavily implies, almost as a necessary consequence really, that the representation of happiness which is a correct understanding of what we meant was not available at the time we specified what happiness is.
6. In a way all I am saying is that when you specify the program that will train your superintelligent AI, in Bostrom 2014 the AI’s superintelligent understanding is not available before you train it.
7. The final goal representation is part of the program that you write before the AI exists.
8. If you had a non superintelligent corrigible AI that builds a world model with a correct specification of happiness in it, you would use that specification.
9. If you had a correct specification of happiness, it would not be wrong.
10. Therefore Bostrom does not expect us to do this, because then the default would not be that your specification is wrong. Bostrom expects by default that our specification is wrong.
11. If Bostrom does not expect us to do this, that implies he does not expect us to build an AI that builds a correct representation of happiness until it is incorrigible or otherwise not able to be used to specify happiness for our superintelligent AI.
12. The default way an AI becomes incorrigible is by becoming more powerful than us.
13. Therefore Bostrom expects we will not have an AI that correctly understands concepts like happiness until after it is already superintelligent.
What links here?
- Thane Ruthenis's comment on Varieties Of Doom by jdp (19 Nov 2025 10:19 UTC; 5 points)
- habryka 19 Nov 2025 17:36 UTC
  4 points
  0
  Parent
  Maybe this argument is right, but the paragraph I am confused about does not mention the word corrigibility once. It just says (paraphrased) “AIs will in fact understand what we mean, which totally pwns Bostrom because he said the opposite, as you can see in this quote” and then fails to provide a quote that says that, at all.
  Like, if you said “Contra Bostrom, AI will be corrigible, which you can see in this quote by Bostrom” then I would not be making this comment thread! I would have objections and could make arguments, and maybe I would bother to make them, but I would not be having the sense that you just said a sentence that really just sounds fully logically contradictory on its own premises, and then when asked about it keep importing context that is not references in the sentence at all.
  So did you just accidentally make a typo and meant to say “Contra Bostrom 2014 AIs will in fact probably be corrigible: ‘The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal.’”
  If that’s the paragraph you meant to write, and this is just a typo, then everything makes sense. If it isn’t, then I am sorry to say that not much that you’ve said helped me understand what you meant by that paragraph.
  - Thane Ruthenis 19 Nov 2025 18:06 UTC
    6 points
    1
    Parent
    My understanding: JDP holds that when the training process chisels a wrong goal into an AI because we gave it a wrong training objective (e. g., “maximize smiles” while we want “maximize eudaimonia”), this event could be validly described as the AI “misunderstanding” us.
    So when JDP says that “AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent”, and claims that this counters this Bostrom quote...
    “The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal.”
    … what JDP means to refer to is the “its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal” part, not the “the AI may indeed understand that this is not what we meant” part. (Pretend the latter part doesn’t exist.)
    Reasoning: The fact that the AI’s goal ended up at “maximize happiness” after being trained against the “maximize happiness” objective, instead of at whatever the programmers intended by the “maximize happiness” objective, implies that there was a moment earlier in training when the AI “misunderstood” that goal (in the sense of “misunderstand” described in my first paragraph).
    JDP then holds that this won’t happen, contrary to that part of Bostrom’s statement: that training on “naïve” pointers to eudaimonia like “maximize smiles” and such will Just Work, that the SGD will point AIs at eudaimonia (or at corrigibility or whatever we meant).^[1] Or, in JDP’s parlance, that the AI will “understand” what we meant by “maximize smiles” well before it’s superintelligent.
    If you think that this use of “misunderstand” is wildly idiosyncratic, or that JDP picked a really bad Bostrom quote to make his point, I agree.
    (Assuming I am also not misunderstanding everything, there sure is a lot of misunderstanding around.)
    ^
    Plus/minus some caveats and additional bells and whistles like e. g. early stopping, I believe.
    - jdp 20 Nov 2025 15:07 UTC
      6 points
      2
      Parent
      I want to flag that thinking you have a representation that could be used in principle to do the right thing is not the same thing as believing it will “Just Work”. If you do a naive RL process on neural embeddings or LLMs evaluators you will definitely get bad results. I do not believe in “alignment by default” and push back on such things frequently whenever they’re brought up. What has happened is that the problem has gone from “not clear how you would do this even in principle, basically literally impossible with current knowledge” to merely tricky.
    - habryka 19 Nov 2025 21:44 UTC
      5 points
      4
      Parent
      not the “the AI may indeed understand that this is not what we meant” part. (Pretend the latter part doesn’t exist.)
      Ok, but the latter part does exist! I can’t ignore it. Like, it’s a sentence that seems almost explicitly designed to clarify that Bostrom thinks the AI will understand what we mean. So clearly, Bostrom is not saying “the AI will not understand what we mean”. Maybe he is making some other error in the book about how when the AI understands the way it does, it has to be corrigible, or that “happiness” is a confused kind of model of what an AI might want to optimize, but clearly that sentence is an atrocious sentence for demonstrating that “Bostrom said that the AI will not understand what we mean”. Like, he literally said the opposite right there, in the quote!
      - Thane Ruthenis 20 Nov 2025 10:11 UTC
        9 points
        5
        Parent
        (JDP, you’re welcome to chime in and demonstrate that your writing was actually perfectly clear and that I’m just also failing basic reading comprehension.)
        So clearly, Bostrom is not saying “the AI will not understand what we mean”
        Consider the AI at two different points in time, AI-when-embryo early in training and AI-when-superintelligence at the end.
        The quote involves Bostrom (a) literally saying that AI-when-superintelligence will understand what we meant,^[1] (b) making a statement which logically implies, as an antecedent, that “AI-when-embryo won’t understand what we meant”.^[2] Therefore, you can logically infer from this quote that Bostrom believes that the statement “AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent” is false.
        JDP, in my understanding, assumes that the reader would do just that: automatically zero-in on (b), infer the antecedent from it, and dismiss (a) as irrelevant context. ~~I love it when blog posts have lil’ tricksy logic puzzles in them.~~
        clearly that sentence is an atrocious sentence for demonstrating that “Bostrom said that the AI will not understand what we mean”
        Yep.
        ^
        “The AI may indeed understand that this is not what we meant.”
        ^
        “However, [AI-when-superintelligence’s] final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal[, because AI-when-embryo “misunderstood” that code’s intent.]”
        jdp 20 Nov 2025 13:48 UTC
        9 points
        0
        Parent
        This is correct, though that particular chain of logic doesn’t actually imply the “before superintelligence” part, since there is a space between embryo and superintelligent where it could theoretically come to understand. I argue why I think Bostrom implicitly rejects this or thinks it must be irrelevant with the 13 steps above. But I think it’s important context that this to me doesn’t come out as 13 steps or a bunch of sys2 reasoning, I just look at the thing and see the implication and then have to do a bunch of sys2 reasoning to articulate it if someone asks. To me it doesn’t feel like a hard thing from the inside, so I wouldn’t expect it to be hard for someone else either. From my perspective it basically came across as bad faith, because I literally could not imagine someone wouldn’t understand what I’m talking about until several people went “no I don’t get it”, that’s how basic it feels from the inside here. I now understand that no this actually isn’t obvious, the hostile tone above was frustration from not knowing that yet.
        Thane Ruthenis 20 Nov 2025 14:52 UTC
        7 points
        1
        Parent
        I see! Understandable, but yep, I think you misjudged the inferential distance there a fair bit.
        jdp 20 Nov 2025 15:24 UTC
        5 points
        0
        Parent
        Clearly! I’m a little reluctant to rephrase it until I have a version that I know conveys what I actually meant, but one that would be very semantically close to the original would be:
        
        “—Contra Bostrom 2014 it is possible to get high quality, nuanced representations of concepts like “happiness” at training initialization. The problem of representing happiness and similar ideas in a computer will not be first solved by the world model of a superintelligent or otherwise incorrigible AI, as in the example Bostrom gives on page 147 in the 2017 paperback under the section “Malignant Failure Modes”: “But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal.”″
        
        Part of why I didn’t write it that way in the first place is it would make it a lot bulkier than the other bullet points, so I trimmed it down.
- Signer 20 Nov 2025 15:33 UTC
  3 points
  0
  Parent
  The fact we don’t do this to begin with heavily implies, almost as a necessary consequence really, that the representation of happiness which is a correct understanding of what we meant was not available at the time we specified what happiness is.
  It depends on what you mean by “available”—we already had a representation of happiness in a human brain. And building corrigible AI that builds a correct representation of happiness is not enough—like you said, we need to point at it.
  If you had a non superintelligent corrigible AI that builds a world model with a correct specification of happiness in it, you would use that specification.
  If you can use it.
  If Bostrom does not expect us to do this, that implies he does not expect us to build an AI that builds a correct representation of happiness until it is incorrigible or otherwise not able to be used to specify happiness for our superintelligent AI.
  Yes, the key is “otherwise not able to be used”.
  Therefore Bostrom expects we will not have an AI that correctly understands concepts like happiness until after it is already superintelligent.
  No, unless by “correctly understands” you mean “have an identifiable representation that humans can use to program other AI”—he may expect that we will have an intelligence that correctly understands concepts like happiness while not yet being superintelligent (like we have humans, that are better at this than “maximize happiness”) but we still won’t be able to use it.
  - jdp 20 Nov 2025 15:48 UTC
    5 points
    0
    Parent
    This is in principle a thing that Nick Bostrom could have believed while writing Superintelligence but the rest of the book kind of makes it incompatible with Occam’s Razor. It’s possible he meant the issues with translating concepts into discrete program representations as the central difficulty and then whether we would be able to make use of such a representation as a noncentral difficulty. (It’s Bostrom, he’s a pretty smart dude, this wouldn’t surprise me, it might even be in the text somewhere but I’m not reading the whole thing again). But even if that’s the case the central consistently repeated version of the value loading problem in Bostrom 2014 centers on how it’s simply not rigorously imaginable how you would get the relevant representations in the first place.
    
    It’s important to remember also that Bostrom’s primary hypothesis in Superintelligence is that AGI will be produced by recursive self improvement such that it’s genuinely not clear you will have a series of functional non superintelligent AIs with usable representations before you have a superintelligent one. The book very much takes the EY “human level is a weird threshold to expect AI progress to stop at” thesis as the default.
    - Signer 20 Nov 2025 17:11 UTC
      1 point
      0
      Parent
      
      But even if that’s the case the central consistently repeated version of the value loading problem in Bostrom 2014 centers on how it’s simply not rigorously imaginable how you would get the relevant representations in the first place.
      
      I’m not so sure. Like, first of all, you mean something like “get before superintelligence” or “get into the goal slot”, because there is obviously a method to just get the representations—just build a superintelligence with a random goal, it will have your representations. That difference was explicitly stated then, it is often explicitly stated now—all that “AI will understand but not care”. The focus on the frameworks where it gets hard to translate from humans to programs is consistent with him trying to constrain methods of generating representations to only useful ones.
      
      There is a reason why it is called “the value loading problem” and not “the value understanding problem”. “The value translation problem” would be somewhat in the middle: having actual human utility program would certainly solve some of Bostrom’s problems.
      
      I don’t know whether Bostrom actually thought about non-superintelligent AI that already understands but don’t care. But I don’t think this line of argumentations of yours is correct about why such a scenario contradicts his points. Even if he didn’t consider it, it’s not “contra”, unless it actually contradicts him. What actually may contradict him is not “AI will understand values early” but “AI will understand values early and training such early AI will make it care about right things”.
- Tapatakt 20 Nov 2025 16:12 UTC
  2 points
  0
  Parent
  This is MUCH more clearly written, thanks.
  We still have the problems that we
  1. can’t extract the exact concept (e.g., concept of human values) from AI. Even if it has this concept somewhere. Yes, we can look which activations correlate with some behaviour, and stuff like that. But it’s far from enough.
  2. can’t train AI to optimize some concept from the world model of its earlier version. We have no ability to formalize the training objective like this.
  Maybe Bostrom thought the weak AIs will not have good enough world model, like you interpret him. Or maybe he already thought that we will not be able to use world model of one AI to direct other. But the conclusion stays anyway.
  I also think that current AIs probably don’t have the concept of human values that would actually be fine to optimize hard. And I’m not sure that AIs will have it before they will have the ability to stop us from changing their goal. But if it was the only problem, I would agree that the risk is more manageable.