David Scott Krueger (formerly: capybaralet) comments on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

David Scott Krueger (formerly: capybaralet) 5 Feb 2025 23:43 UTC
LW: 4 AF: 2
0
AF
First, RE the role of “solving alignment” in this discussion, I just want to note that:
1) I disagree that alignment solves gradual disempowerment problems.
2) Even if it would that does not imply that gradual disempowerment problems aren’t important (since we can’t assume alignment will be solved).
3) I’m not sure what you mean by “alignment is solved”; I’m taking it to mean “AI systems can be trivially intent aligned”. Such a system may still say things like “Well, I can build you a successor that I think has only a 90% chance of being aligned, but will make you win (e.g. survive) if it is aligned. Is that what you want?” and people can respond with “yes”—this is the sort of thing that probably still happens IMO.
4) Alternatively, you might say we’re in the “alignment basin”—I’m not sure what that means, precisely, but I would operationalize it as something like “the AI system is playing a roughly optimal CIRL game”. It’s unclear how good of performance that can yield in practice (e.g. it can’t actually be optimal due to compute limitations), but I suspect it still leaves significant room for fuck-ups.
5) I’m more interested in the case where alignment is not “perfectly” “solved”, and so there are simply clear and obvious opportunities to trade-off safety and performance; I think this is much more realistic to consider.
6) I expect such trade-off opportunities to persist when it comes to assurance (even if alignment is solved), since I expect high-quality assurance to be extremely costly. And it is irresponsible (because it’s subjectively risky) to trust a perfectly aligned AI system absent strong assurances. But of course, people who are willing to YOLO it and just say “seems aligned, let’s ship” will win. This is also part of the problem...

My main response, at a high level:
Consider a simple model:
- We have 2 human/AI teams in competition with each other, A and B.
- A and B both start out with the humans in charge, and then decide whether the humans should stay in charge for the next week.
- Whichever group has more power at the end of the week survives.
- The humans in A ask their AI to make A as powerful as possible at the end of the week.
- The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.
I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I’m getting at? Can you explain what you think it wrong with thinking of it this way?

Responding to some particular points below:

Sure, but these things don’t result in non-human entities obtaining power right?
Yes, they do; they result in beaurocracies and automated decision-making systems obtaining power. People were already having to implement and interact with stupid automated decision-making systems before AI came along.
Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don’t see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?
My main claim was not that these are mechanisms of human disempowerment (although I think they are), but rather that they are indicators of the overall low level of functionality of the world.
- ryan_greenblatt 6 Feb 2025 0:44 UTC
  LW: 4 AF: 4
  2
  AF Parent
  I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I’m getting at? Can you explain what you think it wrong with thinking of it this way?
  
  I think something like this is a reasonable model but I have a few things I’d change.
  
  Whichever group has more power at the end of the week survives.
  
  Why can’t both groups survive? Why is it winner takes all? Can we just talk about the relative change in power over the week? (As in, how much does the power of B reduce relative to A and is this going to be an ongoing trend or it is a one time reduction.)
  
  Probably I’d prefer talking about 2 groups at the start of the singularity. As in, suppose there are two AI companies “A” and “B” where “A” just wants AI systems decended from them to have power and “B” wants to maximize the expected resources under control of humans in B. We’ll suppose that the government and other actors do nothing for simplicity. If they start in the same spot, does “B” end up with substantially less expected power? To make this more realistic (as might be important), we’ll say that “B” has a random lead/disadvantage uniformly distributed between (e.g.) −3 and 3 months so that winner takes all dynamics aren’t a crux.
  
  The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.
  
  What about if the humans in group B ask their AI to make them (the humans) as powerful in expectation?
  
  Supposing you’re fine with these changes, then my claim would be:
  - If alignment is solved, then the AI representing B can powerseek in exactly the same way as the AI representing A does while still defering to the humans on the long run resource usage and still devoting a tiny fraction of resources toward physically keeping the humans alive (which is very cheap, at least once AIs are very powerful). Thus, the cost for B is negligable and B barely loses any power relative to its initial position. If it is winner takes all, B has almost a 50% chance of winning.
  - If alignment isn’t solved, the stategy for B will involve spending a subset of resources on trying to solve alignment. I think alignment is reasonably likely to be practically feasible such that by spending a month of delay to work specifically on safety/alignment (over what A does for commercial reasons) might get B a 50% chance of solving alignment or ending up in a (successful) basin where AIs are trying to actively retain human power / align themselves better. (A substantial fraction of this is via defering to some AI system of dubious trustworthiness because you’re in a huge rush. Yes, the AI systems might fail to align their successors, but this still seems like a one time hair cut from my perspective.) So, if it is winner takes all, (naively) B wins in 2 / 6 * 1 / 2 = 1 / 6 of worlds which is 3x worse than the original 50% baseline. (2 / 6 is because they delay for a month.) But, the issue I’m imagining here wasn’t gradual disempowerment! The issue was that B failed to align their AIs and people at A didn’t care at all about retaining control. (If people at A did care, then coordination is in principle possible, but might not work.)
  I think a crux is that you think there is a perpetual alignment tax while I think a one time tax gets you somewhere.
  
  At a more basic level, when I think about what goes wrong in these worlds, it doesn’t seem very likely to be well described as gradual disempowerment? (In the sense described in the paper.) The existance of an alignment tax doesn’t imply gradual disempowerment. A scenario I find more plausible is that you get value drift (unless you pay a long lasting alignment tax that is substantial), but I don’t think the actual problem will be well described as gradual disempowerment in the sense described in the paper.
  
  (I don’t think I should engage more on gradual disempowerment for the time being unless somewhat wants to bid for this or trade favors for this or similar. Sorry.)
  What links here?
  - Noosphere89's comment on The Risk of Gradual Disempowerment from AI by Zvi (6 Feb 2025 2:46 UTC; 3 points)
  - Charbel-Raphaël 1 May 2025 23:21 UTC
    LW: 3 AF: 2
    1
    AF Parent
    a tiny fraction of resources toward physically keeping the humans alive (which is very cheap, at least once AIs are very powerful)
    I’m not sure it’s very cheap.
    It seems to me that for the same amount of energy and land you need for a human, you could replace a lot more economically valuable work with AI.
    Sure, at some point keeping humans alive is a negligible cost, but there’s a transition period while it’s still relatively expensive—and that’s part of why a lot of people are going to be laid off—even if the company ends up getting super rich.
    - ryan_greenblatt 1 May 2025 23:32 UTC
      LW: 5 AF: 5
      1
      AF Parent
      Right now, the cost of feeding all humans is around 1% of GDP. It’s even cheaper to keep people alive for a year as the food is already there and converting this food into energy for AIs will be harder than getting energy other ways.
      
      If GDP has massively increased due to powerful AIs, the relative cost would go down further.
      
      Sure, resources going to feeding humans could instead go to creating slightly more output (and this will be large at an absolute level), but I’d still call keeping humans alive cheap given the low fraction.
      - Charbel-Raphaël 2 May 2025 18:34 UTC
        LW: 11 AF: 8
        3
        AF Parent
        Thanks for continuing to engage. I really appreciate this thread.
        “Feeding humans” is a pretty low bar. If you want humans to live as comfortably as today, this would be more like 100% of GDP—modulo the fact that GDP is growing.
        But more fundamentally, I’m not sure the correct way to discuss the resource allocation is to think at the civilization level rather than at the company level: Let’s say that we have:
        Company A that is composed of a human (price $5k/month) and 5 automated-humans (price of inference $5k/month let’s say)
        Company B that is composed of 10 automated-humans ($10k/month)
        It seems to me that if you are an investor, you will give your money to B. It seems that in the long term, B is much more competitive, gains more money, is able to reduce its prices, nobody buys from A, and B invests this money into more automated-humans and crushes A and A goes bankrupt. Even if alignment is solved, and the humans listen to his AIs, it’s hard to be competitive.
        ryan_greenblatt 2 May 2025 18:42 UTC
        LW: 5 AF: 5
        1
        AF Parent
        Sure, but none of these things are cruxes for the argument I was making which was that it wasn’t that expensive to keep humans physically alive.
        
        I’m not denying that humans might all be out of work quickly (putting aside regulatory capture, goverment jobs, human job programs, etc). My view is more that if alignment is solved it isn’t hard for some humans to stay alive and retain control, and these humans could also pretty cheaply keep all other humans at a low competitiveness overhead.
        
        I don’t think the typical person should find this reassuring, but the top level posts argues for stronger claims than “the situation might be very unpleasant because everyone will lose their job”.
        What links here?
        Charbel-Raphaël's comment on Interpretability Will Not Reliably Find Deceptive AI by Neel Nanda (8 May 2025 8:47 UTC; 4 points)
        Charbel-Raphaël 2 May 2025 19:26 UTC
        LW: 4 AF: 2
        2
        AF Parent
        OK, thanks a lot, this is much clearer. So basically most humans lose control, but some humans keep control.
        And then we have this meta-stable equilibrium that might be sufficiently stable, where humans at the top are feeding the other humans with some kind of UBI.
        Is this situation desirable? Are you happy with such course of action?
        Is this situation really stable?
        For me, this is not really desirable—the power is probably going to be concentrated into 1-3 people, there is a huge potential for value locking, those CEOs become immortal, we potentially lose democracy (I don’t see companies or US/China governments as particularly democratic right now), the people on the top become potentially progressively corrupted as is often the case. Hmm.
        Then, is this situation really stable?
        If alignment is solved and we have 1 human at the top—pretty much yes, even if revolutions/value drift of the ruler/craziness are somewhat possible at some point maybe?
        If alignment is solved and we have multiple humans competing with their AIs—it depends a bit. It seems to me that we could conduct the same reasoning as above—but not at the level of organizations, but the level of countries: Just as Company B might outcompete Company A by ditching human workers, couldn’t Nation B outcompete Nation A if Nation A dedicates significant resources to UBI while Nation B focuses purely on power? There is also a potential race to the bottom.
        And I’m not sure that cooperation and coordination in such a world would be so much improved: OK, even if the dictator listens to its aligned AI, we need a notion of alignment that is very strong to be able to affirm that all the AIs are going to advocate for “COOPERATE” in the prisoner’s dilemma and that all the dictators are going to listen—but at the same time it’s not that costly to cooperate as you said (even if i’m not sure that energy, land, rare ressources are really that cheap to continue to provide for humans)
        But at least I think that I can see now how we could still live for a few more decades under the authority of a world dictator/pseudo-democracy while this was not clear for me beforehand.
  - ryan_greenblatt 6 Feb 2025 1:24 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Another way to put this is that strategy stealing might not work due to technical alignment difficulties or for other reasons and I’m not sold the other reasons I’ve heard so far are very lethal. I do think the situation might really suck though with e.g. tons of people dying of bioweapons and with some groups that aren’t sufficiently ruthless or which don’t defer enough to AIs getting disempowered.