ryan_greenblatt comments on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

ryan_greenblatt 4 Feb 2025 17:07 UTC
LW: 15 AF: 11
12
AF
I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don’t expect AI to fix these problems; I expect it to exacerbate them.

Sure, but these things don’t result in non-human entities obtaining power right? Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don’t see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?
1. Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive. When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision.
I wasn’t saying people would ask for advice instead of letting AIs run organizations, I was saying they would ask for advice at all. (In fact, if the AI is single-single aligned to them in a real sense and very capable, it’s even better to let that AI make all decisions on your behalf than to get advice. I was saying that even if no one bothers to have a single-single aligned AI representative, they could still ask AIs for advice and unless these AIs are straightforwardly misaligned in this context (e.g., they intentionally give bad advice or don’t try at all without making this clear) they’d get useful advice for their own empowerment.)
1. The idea that decision-makers who “think a goverance structure will yield total human disempowerment” will “do something else” also seems quite implausible. Such decision-makers will likely struggle to retain power. Decision-makers who prioritize their own “power” (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.
I’m claiming that it will selfishly (in terms of personal power) be in their interests to not have such a governance structure and instead have a governance structure which actually increases or retains their personal power. My argument here isn’t about coordination. It’s that I expect individual powerseeking to suffice for individuals not losing their power.

I think this is the disagreement: I expect that selfish/individual powerseeking without any coordination will still result in (some) humans having most power in the absence of technical misalignment problems. Presumably your view is that the marginal amount of power anyone gets via powerseeking is negligible (in the absence of coordination). But, I don’t see why this would be the case. Like all shareholders/board members/etc want to retain their power and thus will vote accordingly which naively will retain their power unless they make a huge error from their own powerseeking perspective. Wasting some resources on negative sum dynamics isn’t a crux for this argument unless you can argue this will waste a substantial fraction of all human resources in the long run?

This isn’t at all an air tight argument to be clear, you can in principle have an equilibrium where if everyone powerseeks (without coordination) everyone gets negligable resources due to negative externalities (that result in some other non-human entity getting power) even if technical misalignment is solved. I just don’t see a very plausible case for this and I don’t think the paper makes this case.

Handing off decision making to AIs is fine—the question is who ultimately gets to spend the profits.

If your claim is “insufficient cooperation and coordination will result in racing to build and hand over power to AIs which will yield bad outcomes due to misaligned AI powerseeking, human power grabs, usage of WMDs (e.g., extreme proliferation of bioweapons yielding an equilibrium where bioweapon usage is likely), and extreme environmental negative externalities due to explosive industrialization (e.g., literally boiling earth’s oceans)” then all of these seem at least somewhat plausible to me, but these aren’t the threat models described in the paper and of this list only misaligned AI powerseeking seems like it would very plausibly result in total human disempowerment.

More minimally, the mitigations discussed in the paper mostly wouldn’t help with these threat models IMO.

(I’m skeptical of insufficient coordination by the time industry is literally boiling the oceans on earth. I also don’t think usage of bioweapons is likely to cause total human disempowerment except in combination with misaligned AI takeover—why would it kill literally all humans? TBC, I think >50% of people dying during the singularity due to conflict (between humans or with misaligned AIs) is pretty plausible even without misalignment concerns and this is obviously very bad, but it wouldn’t yield total human disempowerment.)

I do agree that there are problems other than AI misalignment including that the default distribution of power might be problematic, people might not carefully contemplate what to do with vast cosmic resources (and thus use them poorly), people might go crazy due to super persuation or other cultural forces, society might generally have poor epistemics due to training AIs to have poor epistemics or insufficiently defering to AIs, and many people might die in conflict due to very rapid tech progress.
What links here?
- Noosphere89's comment on The Risk of Gradual Disempowerment from AI by Zvi (6 Feb 2025 2:46 UTC; 3 points)
- David Scott Krueger 5 Feb 2025 23:43 UTC
  LW: 4 AF: 2
  0
  AF Parent
  First, RE the role of “solving alignment” in this discussion, I just want to note that:
  1) I disagree that alignment solves gradual disempowerment problems.
  2) Even if it would that does not imply that gradual disempowerment problems aren’t important (since we can’t assume alignment will be solved).
  3) I’m not sure what you mean by “alignment is solved”; I’m taking it to mean “AI systems can be trivially intent aligned”. Such a system may still say things like “Well, I can build you a successor that I think has only a 90% chance of being aligned, but will make you win (e.g. survive) if it is aligned. Is that what you want?” and people can respond with “yes”—this is the sort of thing that probably still happens IMO.
  4) Alternatively, you might say we’re in the “alignment basin”—I’m not sure what that means, precisely, but I would operationalize it as something like “the AI system is playing a roughly optimal CIRL game”. It’s unclear how good of performance that can yield in practice (e.g. it can’t actually be optimal due to compute limitations), but I suspect it still leaves significant room for fuck-ups.
  5) I’m more interested in the case where alignment is not “perfectly” “solved”, and so there are simply clear and obvious opportunities to trade-off safety and performance; I think this is much more realistic to consider.
  6) I expect such trade-off opportunities to persist when it comes to assurance (even if alignment is solved), since I expect high-quality assurance to be extremely costly. And it is irresponsible (because it’s subjectively risky) to trust a perfectly aligned AI system absent strong assurances. But of course, people who are willing to YOLO it and just say “seems aligned, let’s ship” will win. This is also part of the problem...
  
  My main response, at a high level:
  Consider a simple model:
  - We have 2 human/AI teams in competition with each other, A and B.
  - A and B both start out with the humans in charge, and then decide whether the humans should stay in charge for the next week.
  - Whichever group has more power at the end of the week survives.
  - The humans in A ask their AI to make A as powerful as possible at the end of the week.
  - The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.
  I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I’m getting at? Can you explain what you think it wrong with thinking of it this way?
  
  Responding to some particular points below:
  
  Sure, but these things don’t result in non-human entities obtaining power right?
  Yes, they do; they result in beaurocracies and automated decision-making systems obtaining power. People were already having to implement and interact with stupid automated decision-making systems before AI came along.
  Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don’t see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?
  My main claim was not that these are mechanisms of human disempowerment (although I think they are), but rather that they are indicators of the overall low level of functionality of the world.
  - ryan_greenblatt 6 Feb 2025 0:44 UTC
    LW: 4 AF: 4
    2
    AF Parent
    I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I’m getting at? Can you explain what you think it wrong with thinking of it this way?
    
    I think something like this is a reasonable model but I have a few things I’d change.
    
    Whichever group has more power at the end of the week survives.
    
    Why can’t both groups survive? Why is it winner takes all? Can we just talk about the relative change in power over the week? (As in, how much does the power of B reduce relative to A and is this going to be an ongoing trend or it is a one time reduction.)
    
    Probably I’d prefer talking about 2 groups at the start of the singularity. As in, suppose there are two AI companies “A” and “B” where “A” just wants AI systems decended from them to have power and “B” wants to maximize the expected resources under control of humans in B. We’ll suppose that the government and other actors do nothing for simplicity. If they start in the same spot, does “B” end up with substantially less expected power? To make this more realistic (as might be important), we’ll say that “B” has a random lead/disadvantage uniformly distributed between (e.g.) −3 and 3 months so that winner takes all dynamics aren’t a crux.
    
    The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.
    
    What about if the humans in group B ask their AI to make them (the humans) as powerful in expectation?
    
    Supposing you’re fine with these changes, then my claim would be:
    
    If alignment is solved, then the AI representing B can powerseek in exactly the same way as the AI representing A does while still defering to the humans on the long run resource usage and still devoting a tiny fraction of resources toward physically keeping the humans alive (which is very cheap, at least once AIs are very powerful). Thus, the cost for B is negligable and B barely loses any power relative to its initial position. If it is winner takes all, B has almost a 50% chance of winning.
    If alignment isn’t solved, the stategy for B will involve spending a subset of resources on trying to solve alignment. I think alignment is reasonably likely to be practically feasible such that by spending a month of delay to work specifically on safety/alignment (over what A does for commercial reasons) might get B a 50% chance of solving alignment or ending up in a (successful) basin where AIs are trying to actively retain human power / align themselves better. (A substantial fraction of this is via defering to some AI system of dubious trustworthiness because you’re in a huge rush. Yes, the AI systems might fail to align their successors, but this still seems like a one time hair cut from my perspective.) So, if it is winner takes all, (naively) B wins in 2 / 6 * 1 / 2 = 1 / 6 of worlds which is 3x worse than the original 50% baseline. (2 / 6 is because they delay for a month.) But, the issue I’m imagining here wasn’t gradual disempowerment! The issue was that B failed to align their AIs and people at A didn’t care at all about retaining control. (If people at A did care, then coordination is in principle possible, but might not work.)
    
    I think a crux is that you think there is a perpetual alignment tax while I think a one time tax gets you somewhere.
    
    At a more basic level, when I think about what goes wrong in these worlds, it doesn’t seem very likely to be well described as gradual disempowerment? (In the sense described in the paper.) The existance of an alignment tax doesn’t imply gradual disempowerment. A scenario I find more plausible is that you get value drift (unless you pay a long lasting alignment tax that is substantial), but I don’t think the actual problem will be well described as gradual disempowerment in the sense described in the paper.
    
    (I don’t think I should engage more on gradual disempowerment for the time being unless somewhat wants to bid for this or trade favors for this or similar. Sorry.)
    What links here?
    Wei Dai's comment on Wei Dai’s Shortform by Wei Dai (7 Apr 2026 6:09 UTC; 43 points)
    Noosphere89's comment on The Risk of Gradual Disempowerment from AI by Zvi (6 Feb 2025 2:46 UTC; 3 points)
    Lukas Finnveden's comment on Wei Dai’s Shortform by Wei Dai (8 Apr 2026 1:07 UTC; 2 points)
    - Charbel-Raphaël 1 May 2025 23:21 UTC
      LW: 3 AF: 2
      1
      AF Parent
      a tiny fraction of resources toward physically keeping the humans alive (which is very cheap, at least once AIs are very powerful)
      I’m not sure it’s very cheap.
      It seems to me that for the same amount of energy and land you need for a human, you could replace a lot more economically valuable work with AI.
      Sure, at some point keeping humans alive is a negligible cost, but there’s a transition period while it’s still relatively expensive—and that’s part of why a lot of people are going to be laid off—even if the company ends up getting super rich.
      - ryan_greenblatt 1 May 2025 23:32 UTC
        LW: 5 AF: 5
        1
        AF Parent
        Right now, the cost of feeding all humans is around 1% of GDP. It’s even cheaper to keep people alive for a year as the food is already there and converting this food into energy for AIs will be harder than getting energy other ways.
        
        If GDP has massively increased due to powerful AIs, the relative cost would go down further.
        
        Sure, resources going to feeding humans could instead go to creating slightly more output (and this will be large at an absolute level), but I’d still call keeping humans alive cheap given the low fraction.
        Charbel-Raphaël 2 May 2025 18:34 UTC
        LW: 11 AF: 8
        3
        AF Parent
        Thanks for continuing to engage. I really appreciate this thread.
        “Feeding humans” is a pretty low bar. If you want humans to live as comfortably as today, this would be more like 100% of GDP—modulo the fact that GDP is growing.
        But more fundamentally, I’m not sure the correct way to discuss the resource allocation is to think at the civilization level rather than at the company level: Let’s say that we have:
        Company A that is composed of a human (price $5k/month) and 5 automated-humans (price of inference $5k/month let’s say)
        Company B that is composed of 10 automated-humans ($10k/month)
        It seems to me that if you are an investor, you will give your money to B. It seems that in the long term, B is much more competitive, gains more money, is able to reduce its prices, nobody buys from A, and B invests this money into more automated-humans and crushes A and A goes bankrupt. Even if alignment is solved, and the humans listen to his AIs, it’s hard to be competitive.
        ryan_greenblatt 2 May 2025 18:42 UTC
        LW: 5 AF: 5
        1
        AF Parent
        Sure, but none of these things are cruxes for the argument I was making which was that it wasn’t that expensive to keep humans physically alive.
        
        I’m not denying that humans might all be out of work quickly (putting aside regulatory capture, goverment jobs, human job programs, etc). My view is more that if alignment is solved it isn’t hard for some humans to stay alive and retain control, and these humans could also pretty cheaply keep all other humans at a low competitiveness overhead.
        
        I don’t think the typical person should find this reassuring, but the top level posts argues for stronger claims than “the situation might be very unpleasant because everyone will lose their job”.
        What links here?
        Charbel-Raphaël's comment on Interpretability Will Not Reliably Find Deceptive AI by Neel Nanda (8 May 2025 8:47 UTC; 4 points)
        Charbel-Raphaël 2 May 2025 19:26 UTC
        LW: 4 AF: 2
        2
        AF Parent
        OK, thanks a lot, this is much clearer. So basically most humans lose control, but some humans keep control.
        And then we have this meta-stable equilibrium that might be sufficiently stable, where humans at the top are feeding the other humans with some kind of UBI.
        Is this situation desirable? Are you happy with such course of action?
        Is this situation really stable?
        For me, this is not really desirable—the power is probably going to be concentrated into 1-3 people, there is a huge potential for value locking, those CEOs become immortal, we potentially lose democracy (I don’t see companies or US/China governments as particularly democratic right now), the people on the top become potentially progressively corrupted as is often the case. Hmm.
        Then, is this situation really stable?
        If alignment is solved and we have 1 human at the top—pretty much yes, even if revolutions/value drift of the ruler/craziness are somewhat possible at some point maybe?
        If alignment is solved and we have multiple humans competing with their AIs—it depends a bit. It seems to me that we could conduct the same reasoning as above—but not at the level of organizations, but the level of countries: Just as Company B might outcompete Company A by ditching human workers, couldn’t Nation B outcompete Nation A if Nation A dedicates significant resources to UBI while Nation B focuses purely on power? There is also a potential race to the bottom.
        And I’m not sure that cooperation and coordination in such a world would be so much improved: OK, even if the dictator listens to its aligned AI, we need a notion of alignment that is very strong to be able to affirm that all the AIs are going to advocate for “COOPERATE” in the prisoner’s dilemma and that all the dictators are going to listen—but at the same time it’s not that costly to cooperate as you said (even if i’m not sure that energy, land, rare ressources are really that cheap to continue to provide for humans)
        But at least I think that I can see now how we could still live for a few more decades under the authority of a world dictator/pseudo-democracy while this was not clear for me beforehand.
    - ryan_greenblatt 6 Feb 2025 1:24 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Another way to put this is that strategy stealing might not work due to technical alignment difficulties or for other reasons and I’m not sold the other reasons I’ve heard so far are very lethal. I do think the situation might really suck though with e.g. tons of people dying of bioweapons and with some groups that aren’t sufficiently ruthless or which don’t defer enough to AIs getting disempowered.
- Noosphere89 4 Feb 2025 17:44 UTC
  2 points
  0
  Parent
  BTW, this is also a crux for me as well, in that I do believe that absent technical misalignment, some humans will have most of the power by default, rather than AIs, because I believe AI rights will be limited by default.
  
  I think this is the disagreement: I expect that selfish/individual powerseeking without any coordination will still result in (some) humans having most power in the absence of technical misalignment problems. Presumably your view is that the marginal amount of power anyone gets via powerseeking is negligible (in the absence of coordination). But, I don’t see why this would be the case. Like all shareholders/board members/etc want to retain their power and thus will vote accordingly which naively will retain their power unless they make a huge error from their own powerseeking perspective. Wasting some resources on negative sum dynamics isn’t a crux for this argument unless you can argue this will waste a substantial fraction of all human resources?