Signer comments on Varieties Of Doom

Signer 20 Nov 2025 15:33 UTC
3 points
0
1. The fact we don’t do this to begin with heavily implies, almost as a necessary consequence really, that the representation of happiness which is a correct understanding of what we meant was not available at the time we specified what happiness is.
It depends on what you mean by “available”—we already had a representation of happiness in a human brain. And building corrigible AI that builds a correct representation of happiness is not enough—like you said, we need to point at it.
1. If you had a non superintelligent corrigible AI that builds a world model with a correct specification of happiness in it, you would use that specification.
If you can use it.
1. If Bostrom does not expect us to do this, that implies he does not expect us to build an AI that builds a correct representation of happiness until it is incorrigible or otherwise not able to be used to specify happiness for our superintelligent AI.
Yes, the key is “otherwise not able to be used”.
1. Therefore Bostrom expects we will not have an AI that correctly understands concepts like happiness until after it is already superintelligent.
No, unless by “correctly understands” you mean “have an identifiable representation that humans can use to program other AI”—he may expect that we will have an intelligence that correctly understands concepts like happiness while not yet being superintelligent (like we have humans, that are better at this than “maximize happiness”) but we still won’t be able to use it.
- jdp 20 Nov 2025 15:48 UTC
  5 points
  0
  Parent
  This is in principle a thing that Nick Bostrom could have believed while writing Superintelligence but the rest of the book kind of makes it incompatible with Occam’s Razor. It’s possible he meant the issues with translating concepts into discrete program representations as the central difficulty and then whether we would be able to make use of such a representation as a noncentral difficulty. (It’s Bostrom, he’s a pretty smart dude, this wouldn’t surprise me, it might even be in the text somewhere but I’m not reading the whole thing again). But even if that’s the case the central consistently repeated version of the value loading problem in Bostrom 2014 centers on how it’s simply not rigorously imaginable how you would get the relevant representations in the first place.
  
  It’s important to remember also that Bostrom’s primary hypothesis in Superintelligence is that AGI will be produced by recursive self improvement such that it’s genuinely not clear you will have a series of functional non superintelligent AIs with usable representations before you have a superintelligent one. The book very much takes the EY “human level is a weird threshold to expect AI progress to stop at” thesis as the default.
  - Signer 20 Nov 2025 17:11 UTC
    1 point
    0
    Parent
    
    But even if that’s the case the central consistently repeated version of the value loading problem in Bostrom 2014 centers on how it’s simply not rigorously imaginable how you would get the relevant representations in the first place.
    
    I’m not so sure. Like, first of all, you mean something like “get before superintelligence” or “get into the goal slot”, because there is obviously a method to just get the representations—just build a superintelligence with a random goal, it will have your representations. That difference was explicitly stated then, it is often explicitly stated now—all that “AI will understand but not care”. The focus on the frameworks where it gets hard to translate from humans to programs is consistent with him trying to constrain methods of generating representations to only useful ones.
    
    There is a reason why it is called “the value loading problem” and not “the value understanding problem”. “The value translation problem” would be somewhat in the middle: having actual human utility program would certainly solve some of Bostrom’s problems.
    
    I don’t know whether Bostrom actually thought about non-superintelligent AI that already understands but don’t care. But I don’t think this line of argumentations of yours is correct about why such a scenario contradicts his points. Even if he didn’t consider it, it’s not “contra”, unless it actually contradicts him. What actually may contradict him is not “AI will understand values early” but “AI will understand values early and training such early AI will make it care about right things”.