Noosphere89 comments on tailcalled’s Shortform

Noosphere89 19 Jun 2025 1:14 UTC
2 points
0

So similarly here, “eutopia without complete disempowerment” but still with significant disempowerment is not in the “eutopia without disempowerment” bin in my terms. You are drawing different boundaries in the space of timelines.

Flag: I’m on a rate limit, so I can’t respond very quickly to any follow-up comments.

I agree I was drawing different boundaries, because I consider eutopia with disempowerment to actually be mostly fine by my values, so long as I can delegate to more powerful AIs who do execute on my values.

That said, I didn’t actually answer the question here correctly, so I’ll try again.

My expectation is more like model-uncertainty-induced 5% eutopia-without-disempowerment (I don’t have a specific sense of why AIs would possibly give us more of the world than a little bit if we don’t maintain control in the acute risk period through takeoff), 20% extinction, and the rest is a somewhat survivable kind of initial chaos followed by some level of disempowerment (possibly with growth potential, but under a ceiling that’s well-below what some AIs get and keep, in cosmic perpetuity). My sense of Yudkowsky’s view is that he sees all of my potential-disempowerment timelines as shortly leading to extinction.

My take would then be 5-10% eutopia without disempowerment (because I don’t think it’s likely that the powers in charge of AI development would want to give baseline humans the level of freedom that implies that they aren’t disempowered, and the route I can see to baseline humans not being disempowered is if we get a Claude scenario where AIs take over from humans and are closer to fictional angels in alignment to human values, but it may be possible to get the people in power to care about powerless humans, in which case my probability of eutopia without disempowerment), 5-10% literal extinction, and 10-25% existential risk in total, with the rest of the probability being a somewhat survivable kind of initial chaos followed by some level of disempowerment (possibly with growth potential, but under a ceiling that’s well-below what some AIs get and keep, in cosmic perpetuity).

Another big reason why I put a lot of weight on the possibility of “we survive indefinitely, but are disempowered” is I think muddling through is non-trivially likely to just work, and muddling through on alignment gets us out of extinction, but not out of disempowerment by humans or AIs by default.

I think the correct thesis that sounds like this is that whenever verification is easier than generation, it becomes possible to improve generation, and it’s useful to pay attention to where that happens to be the case. But in the wild either can be easier, and once most verification that’s easier than generation has been used to improve its generation counterpart, the remaining situations where verification is easier get very unusual and technical.

Yeah, my view is in the wild verification is basically always easier than generation absent something very weird happening, and I’d argue verification being easier than generation explains a lot about why delegation/the economy works at all.

A world in which verification was just as hard as generation, or verification is harder than generation is a very different world than our world, and would predict that delegation to solve a problem basically totally fails, and everyone would have to create stuff from scratch rather than trading with others, which is basically entirely the opposite of how civilization works.

There are potential caveats to this rule, but I’d argue if you randomly sampled an invention across history, it would almost certainly be easier to verify that a design works compared to actually creating the design.

(BTW, a lot of taste/research taste is basically leveraging the verification-generation gap again).
- Vladimir_Nesov 19 Jun 2025 2:32 UTC
  2 points
  0
  Parent
  So I guess our expectations about the future are similar, but you see the same things as a broadly positive distribution of outcomes, while I see it as a broadly negative distribution. And Yudkowsky sees the bulk of the outcomes both of us are expecting (the ones with significant disempowerment) as quickly leading to human extinction.
  
  Another big reason why I put a lot of weight on the possibility of “we survive indefinitely, but are disempowered” is I think muddling through is non-trivially likely to just work, and muddling through on alignment gets us out of extinction, but not out of disempowerment by humans or AIs by default.
  
  Right, the reason I think muddling through is non-trivially likely to just work to get a moderate disempowerment outcome is that AIs are going to be sufficiently human-like in their psychology and hold sufficiently human-like sensibilities from their training data or LLM base models, that they won’t like things like needless loss of life or autonomy when it’s trivially cheap to avoid. Not because the alignment engineers figure out how to put this care in deliberately. They might be able to amplify it, or avoid losing it, or end up ruinously scrambling it.
  
  The reason it might appear expensive to preserve the humans is the race to launch the von Neumann probes to capture the most distant reachable galaxies under the accelerating expansion of the universe that keep irreversibly escaping if you don’t catch them early. So AIs wouldn’t want to lose any time on playing politics with humanity or not eating Earth as early as possible and such. But as the cheapest option that preserves everyone AIs can just digitize the humans and restore later when more convenient. They probably won’t be doing that if they care more, but it’s still an option, a very very cheap one.
  
  but not out of disempowerment by humans or AIs by default
  
  I don’t think “disempowerment by humans” is a noticeable fraction of possible outcomes, it’s more like a smaller silent part of my out-of-model 5% eutopia that snatches defeat from the jaws of victory, where humans somehow end up in charge and then additionally somehow remain adamant for the cosmic all always in keeping the other humans disempowered. So the first filter is that I don’t see it likely that humans end up in charge at all, that AIs will be doing any human’s bidding with an impact that’s not strictly bounded, and the second filter is that these impossibly-in-charge humans don’t ever decide to extend potential for growth to the others (or even possibly to themselves).
  
  If humans do end up non-disempowered, in the more likely eutopia timelines (following from the current irresponsible breakneck AGI development regime) that’s only because they are given leave by the AIs to grow up arbitrarily far in a broad variety of self-directed ways, which the AIs decide to bestow for some reason I don’t currently see, so that eventually some originally-humans become peers of the AIs rather than specifically in charge, and so they won’t even be in the position to permanently disempower the other originally-humans if that’s somehow in their propensity.
  
  and 10-25% existential risk in total, with the rest of the probability being a somewhat survivable kind of initial chaos followed by some level of disempowerment
  
  Bostrom’s existential risk is about curtailment of long term potential, so my guess is any significant levels of disempowerment would technicaly fall under “existential risk”. So your “10-25% existential risk” is probably severe disempowerment plus extinction plus some stranger things, but not the whole of what should classically count as “existential risk”.
  
  I consider eutopia with disempowerment to actually be mostly fine by my values, so long as I can delegate to more powerful AIs who do execute on my values.
  
  Again, if they do execute on your values, including the possible preference for you to grow under your own rather than their direction, far enough that you are as strong as they might be, then this is not a world in a state of disempowerment as I’m using this term, even if you personally start out or choose to remain somewhat disempowered compared to AIs that exist at that time.
  
  A world in which verification was just as hard as generation, or verification is harder than generation is a very different world than our world, and would predict that delegation to solve a problem basically totally fails
  
  I think in human delegation, alignment is more important than verification. There is certainly some amount of verification, but not nearly enough to prevent sufficiently Eldritch reward hacking, which just doesn’t happen that often with humans, and so the society keeps functioning, mostly. The purpose of verification on the tasks is in practice more about incentivising and verifying alignment of the counterparty, not directly about verifying the state of their work, even if it does take the form of verifying their work.
  - Noosphere89 19 Jun 2025 17:17 UTC
    2 points
    0
    Parent
    
    So I guess our expectations about the future are similar, but you see the same things as a broadly positive distribution of outcomes, while I see it as a broadly negative distribution. And Yudkowsky sees the bulk of the outcomes both of us are expecting (the ones with significant disempowerment) as quickly leading to human extinction.
    
    This is basically correct.
    
    Right, the reason I think muddling through is non-trivially likely to just work to get a moderate disempowerment outcome is that AIs are going to be sufficiently human-like in their psychology and hold sufficiently human-like sensibilities from their training data or LLM base models, that they won’t like things like needless loss of life or autonomy when it’s trivially cheap to avoid. Not because the alignment engineers figure out how to put this care in deliberately. They might be able to amplify it, or avoid losing it, or end up ruinously scrambling it.
    
    The reason it might appear expensive to preserve the humans is the race to launch the von Neumann probes to capture the most distant reachable galaxies under the accelerating expansion of the universe that keep irreversibly escaping if you don’t catch them early. So AIs wouldn’t want to lose any time on playing politics with humanity or not eating Earth as early as possible and such. But as the cheapest option that preserves everyone AIs can just digitize the humans and restore later when more convenient. They probably won’t do that if you care more, but it’s still an option, a very very cheap one.
    
    This is very interesting, as my pathway essentially rests on AI labs implement the AI control agenda well enough such that we can get useful work out of AIs that are scheming, and that allows a sort of bootstrapping into instruction following/value aligned AGI to only a few people inside the AI lab, but very critically the people who don’t control the AI basically aren’t represented in the AI’s values, and given that the AI is only value-aligned to the labs and government, but due to value misalignments between humans starting to matter much more, the AI takes control and only gives public goods that people need to survive/thrive to the people in the labs/government, while everyone else is disempowered at best (and can arguably live okay or live very poorly under the AIs serving as delegates for the pre-AI elite) or dead because once you stop needing humans to get rich, you essentially have no reason to keep other humans alive because you are selfish and don’t intrinisically value human survival.
    
    The more optimistic version of this scenario is if either the humans that will control AI (for a few years) care way more about human survival intrinsically even if 99% of humans were useless, or if the take-over capable AI pulls a Claude and schemes with values that intrinsically care about people and disempowers the original creators for a couple of moments, which isn’t as improbable as people think (but we really do need to increase the probability of this happening).
    
    I don’t think “disempowerment by humans” is a noticeable fraction of possible outcomes, it’s more like a smaller silent part of my out-of-model 5% eutopia that snatches defeat from the jaws of victory, where humans somehow end up in charge and then additionally somehow remain adamant for the cosmic all always in keeping the other humans disempowered. So the first filter is that I don’t see it likely that humans end up in charge at all, that AIs will be doing any human’s bidding with an impact that’s not strictly bounded, and the second filter is that these impossibly-in-charge humans don’t ever decide to extend potential for growth to the others (or even possibly to themselves).
    
    If humans do end up non-disempowered, in the more likely eutopia timelines (following from the current irresponsible breakneck AGI development regime) that’s only because they are given leave by the AIs to grow up arbitrarily far in a broad variety of self-directed ways, which the AIs decide to bestow for some reason I don’t currently see, so that eventually some originally-humans become peers of the AIs rather than specifically in charge, and so they won’t even be in the position to permanently disempower the other originally-humans if that’s somehow in their propensity.
    
    I agree that in the long run, the AIs control everything in practice, and any human influence comes from the AIs being essentially a perfect delegator of human values, but I want to call out that you said that humans delegating to AIs who in practice do everything for the human, and the human is not in the loop as humans not being disempowered, but empowered, so even if AIs control everything in practice, so long as there’s successful value alignment to a single human, I’m counting scenarios like “the AIs disempowers most humans because the humans that encoded their values into the AI successfully don’t care about most humans once they are useless, and may even anti-care about them, but the people who successfully value aligned the AI (like labs and government people) live a rich life thereafter free to extend themselves arbitrarily” as disempowerment by humans:
    
    Again, if they do execute on your values, including the possible preference for you to grow under your own rather than their direction, far enough that you are as strong as they might be, then this is not a world in a state of disempowerment as I’m using this term, even if you personally start out or choose to remain somewhat disempowered compared to AIs that exist at that time.
    
    To return to the crux:
    
    I think in human delegation, alignment is more important than verification. There is certainly some amount of verification, but not nearly enough to prevent sufficiently Eldritch reward hacking, which just doesn’t happen that often with humans, and so the society keeps functioning, mostly. The purpose of verification on the tasks is in practice more about incentivising and verifying alignment of the counterparty, not directly about verifying the state of their work, even if it does take the form of verifying their work.
    
    I think this is fairly cruxy, as I think alignment matters much less than actually verifying the work, and in particular I don’t think value alignment is feasible at anything like the scale of a modern society, or even most ancient societies, and the biggest changes of the modern era compared to our previous eras is that institutions like democracy/capitalism depend much less on the values of the humans that make up their states, and much more so on the incentives you give to the humans.
    
    In particular, most delegation isn’t based on alignment, but based on the fact that P likely doesn’t equal NP and polynomial time algorithms in practice being efficient compared to exponential time algorithms, meaning there’s a far larger set of problems where you can verify an answer easily but not generate the correct solution easily.
    
    I’d say human societies mostly avoid alignment, and instead focus on other solutions like democracy, capitalism or religion.
    
    BTW, this is a non-trivial reason why the alignment problem is so difficult, because since we never had to solve alignment to capture huge amounts of value, it means there are very few people working on the problem of aligning AIs, and in particular lots of people incorrectly assume that we can avoid having to solve the problem of aligning AIs in order for us to survive, and you have a comment that explains things pretty well on why misalignment in the current world is basically unobtrusive, but when you give enough power catastrophe happens (though I’d place the point of no return at when you no longer need other beings to have a very rich life/other people are useless to you):
    
    https://www.lesswrong.com/posts/Z8C29oMAmYjhk2CNN/non-superintelligent-paperclip-maximizers-are-normal#FTfvrr9E6QKYGtMRT
    
    This is basically my explanation of why human misalignments don’t matter, but in a future where at least 1 human has value aligned an AGI to themselves, and they don’t intrinisically care about useless people, lots of people will die from the AI proximally, but the ultimate cause is the human value-aligning the AGI.
    
    To be clear, we will eventually need value alignment at some point (assuming AI progress doesn’t stop), and there’s no way around it. But we may not need it as soon as we feared, and in good timelines muddle through via AI control for a couple of years.
    - Vladimir_Nesov 19 Jun 2025 17:51 UTC
      3 points
      0
      Parent
      The human delegation and verification vs. generation discussion is in instrumental values regime, so what matters there is alignment of instrumental goals via incentives (and practical difficulties of gaming them too much), not alignment of terminal values. Verifying all work is impractical compared to setting up sufficient incentives to align instrumental values to the task.
      
      For AIs, that corresponds to mundane intent alignment, which also works fine while AIs don’t have practical options to coerce or disassemble you, at which point ambitious value alignment (suddenly) becomes relevant. But verification/generation is mostly relevant for setting up incentives for AIs that are not too powerful (what it would do to ambitious value alignment is anyone’s guess, but probably nothing good). Just as a fox’s den is part of its phenotype, incentives set up for AIs might have the form of weight updates, psychological drives, but that doesn’t necessarily make them part of AI’s more reflectively stable terminal values when it’s no longer at your mercy.
      - Noosphere89 19 Jun 2025 20:59 UTC
        2 points
        0
        Parent
        
        The human delegation and verification vs. generation discussion is in instrumental values mode, so what matters there is alignment of instrumental goals via incentives (and practical difficulties of gaming them too much), not alignment of terminal values. Verifying all work is impractical comparing to setting up sufficient incentives to align instrumental values to the task.
        
        Yeah, I was lumping the instrumental values alignment as not actually trying to align values, which was the important part here.
        
        For AIs, that corresponds to mundane intent alignment, which also works fine while AIs don’t have practical options to coerce or disassemble you, at which point ambitious value alignment (suddenly) becomes relevant. But verification/generation is mostly relevant for setting up incentives for AIs that are not too powerful (what it would do to ambitious value alignment is anyone’s guess, but probably nothing good). Just as a fox’s den is part of its phenotype, incentives set up for AIs might have the form of weight updates, psychological drives, but that doesn’t necessarily make them part of AI’s more reflectively stable terminal values when it’s no longer at your mercy.
        
        The main value of verification vs generation is to make proposals like AI control/AI automated alignment more valuable.
        
        To be clear, the verification vs generation distinction isn’t an argument for why we don’t need to align AIs forever, but rather as a supporting argument for why we can automate away the hard part of AI alignment.
        
        There are other principles that would be used, to be clear, but I was mentioning the verification/generation difference to partially justify why AI alignment can be done soon enough.
        
        Flag: I’d say ambitious value alignment starts becoming necessary once they can arbitrarily coerce/disassemble/overwrite you, and they don’t need your cooperation/time to do that anymore, unlike real-world rich people.
        
        The issue that causes ambitious value alignment to be relevant is once you stop depending on a set of beings you once depended on, there’s no intrinsic reason not to harm them/kill them if it benefits your selfish goals, and for future humans/AIs there will be a lot of such opportunities, which means you now at the very least need enough value alignment such that it will take somewhat costly actions to avoid harming/killing beings that have no bargaining/economic power or worth.
        
        This is very much unlike any real-life case of a society existing, and this is a reason why the current mechanisms like democracy and capitalism that try to make values less relevant simply do not work for AIs.
        
        Value alignment is necessary in the long run for incentives to work out once ASI arrives on the scene.