Fabien Roger comments on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Fabien Roger 2 Feb 2025 19:53 UTC
LW: 32 AF: 14
5
AF
Thank you for writing this! I think it’s probably true that sth like “society’s alignment to human interest implicitly relies on human labor and cognition” is correct and that we will have to find clever solutions, lots of resources and political will to maintain alignment if human labor and cognition stops playing a large role. I am glad some people are thinking about these risks.
While I think the essay describes dynamics which I think are likely to result in a scary power concentration, I think this is more likely to be a power concentration for humans or more straightforwardly misaligned AIs rather than some notion of complete disempowerment. I’d be excited about a follow-up work which focuses on the argument for power concentration, which seems more likely to be robust and accurate to me.
Some criticism on complete disempowerment (that goes beyond power concentration):
(This probably reflects mostly ignorance on my part rather than genuine weaknesses of your arguments. I have thought some about coordination difficulties but it is not my specialty.)
I think that the world currently has and will continue to have a few properties which make the scenario described in the essay look less likely:
- Baseline alignment: AIs are likely to be sufficiently intent aligned that it’s easy to prevent egregious lying and tampering with measurements (in the AI and its descendants) if their creators want to.
  - I am not very confident about this, but mostly because I think scheming is likely for AIs that can replace humans (maybe p=20%?), and even absent scheming, it is plausible that you get AIs that lie egregiously and tamper with measurements even if you somewhat try to prevent it (maybe p=10%?)
  - I expect that you will get some egregious lying and tampering, just like companies sometimes do, but that it will be forbidden, and that it is relatively easy to create an AI “police” that enforce a relatively low level of egregious lies (and that, like in the current world, enough people want that police that it is created).
- No strong AI rights before full alignment: There won’t be a powerful society that gives extremely productive AIs “human-like rights” (and in particular strong property rights) prior to being relatively confident that AIs are aligned to human values.
  - I think it’s plausible that fully AI-run entities are given the same status as companies—but I expect that the surplus they generate will remain owned by some humans throughout the relevant transition period.
  - I also think it’s plausible that some weak entities will give AIs these rights, but that this won’t matter because most “AI power” will be controlled by humans that care about it remaining the case as long as we don’t have full alignment.
- No hot global war: We won’t be in a situation where a conflict that has a decent chance of destroying humanity (or that lasts forever, consuming all resources) seems plausibly like a good idea to humans.
  - Granted, this may be assuming the conclusion. But to the extent that this is the threat, I think it is worth making it clear.
  - I am keen for a description for how international tensions heightened up so high that we get this level of animosity. My guess is that we might get a hot war for reasons like “State A is afraid of falling behind State B and thus starts a hot war before it’s too late”, and I don’t think that this relies on the feedback loops described in the essay (and is sufficiently bad on its own that the essay’s dynamics do not matter).
I agree that if we ever lose one of these three properties (and especially the first one), it would be difficult to get them back because of the feedback loops described in the essay. (If you want to argue against these properties, please start from a world like ours, where these three properties are true.) I am curious which property you think is most likely to fall first.
When assuming the combination of these properties, I think that this makes many of the specific threats and positive feedback loops described in the essay look less likely:
- Owners of capital will remain humans and will remain aware of the consequences of the actions of “their AIs”. They will remain able to change the user of that AI labor if they desire so.
- Politicians (e.g. senators) will remain aware of the consequences of the state’s AIs’ actions (even if the actual process becomes a black box). They will remain able to change what the system is if it has obviously bad consequences (terrible material conditions and tons of Von Neumann probes with weird objectives spreading throughout the universe is obviously bad if you are not in a hot global war).
- Human consumers of culture will remain able to choose what culture they consume.
  - I agree the brain-hacking stuff is worrisome, but my guess is that if it gets “obviously bad”, people will be able to tell before it’s too late.
  - I expect that changes in media to be mostly symmetric about their content, and in particular not strongly favor conflict over peace (media creators can choose to make slightly less good media that promotes certain views, but because of human ownership of capital and no-lying I expect that this to not be a massive change in dynamics).
  - Maybe it naturally favors strong AI rights to have media be created by AIs because of AI relationships? I expect it to not be the case because the intuitive case against strong AI rights seems super strong in the current society (and there are other ways to make legitimate AI-human relationships, like not letting AI partners getting massively wealthy and powerful), but this is maybe where I am most worried in worlds with slow AI progress.
I think these properties are only somewhat unlikely to be false and thus I think it is worth working on making them true. But I feel like them being false is somewhat obviously catastrophic in a range of scenarios much broader than the ones described in the essay and thus it may be better to work on them directly rather than trying to do something more “systemic”.
On a more meta note, I think this essay would have benefited from a bit more concreteness in the scenarios it describes and in the empirical claims it relies on. There is some of that (e.g. on rentier states), but I think there could have been more. I think What does it take to defend the world against out-of-control AGIs? makes related arguments about coordination difficulties (though not on gradual disempowerment) in a way that made more sense to me, giving examples of very concrete “caricature-but-plausible” scenarios and pointing at relevant and analogous coordination failures in the current world.
What links here?
- Noosphere89's comment on The Risk of Gradual Disempowerment from AI by Zvi (6 Feb 2025 2:46 UTC; 3 points)
- Raymond Douglas 2 Feb 2025 22:48 UTC
  16 points
  0
  Parent
  Thank you for the very detailed comment! I’m pretty sympathetic to a lot of what you’re saying, and mostly agree with you about the three properties you describe. I also think we ought to do some more spelling-out of the relationship between gradual disempowerment and takeover risk, which isn’t very fleshed-out in the paper — a decent part of why I’m interested in it is because I think it increases takeover risk, in a similar but more general way to the way that race dynamics increase takeover risk.
  I’m going to try to respond to the specific points you lay out, probably not in enough detail to be super persuasive but hopefully in a way that makes it clearer where we might disagree, and I’d welcome any followup questions off the back of that. (Note also that my coauthors might not endorse all this.)
  Responding to the specific assumptions you lay out:
  - No egregious lying — Agree this will probably be doable, and pretty interested in the prospect of ‘ai police’, but not a crux for me. I think that, for instance, much of the harm caused by unethical business practices or mass manipulation is not reliant on outright lies but rather on ruthless optimisation. A lot also comes from cases where heavy manipulation of facts in a technically not-false way is strongly incentivised.
  - No strong AI rights — Agree de jure, disagree de facto, and partly this is contingent on the relative speeds of research and proliferation. Mostly I think there will be incentives and competitive pressures to give AIs decision making power, and that oversight will be hard, and that much of the harm will be emergent from complex interactions. I also think it’s maybe interesting to reflect on Ryan and Kyle’s recent piece about paying AIs to reveal misalignment — locally it makes sense, but globally I wonder what happens if we do a lot of that in the next few years.
  - No hot global war — Agree, also not a crux for me, although I think the prospect of war might generate pretty bad incentives and competitive pressures.
  Overall, I think I can picture worlds where (conditional on no takeover) we reach states of pretty serious disempowerment of the kind described in the paper, without any of these assumptions fully breaking down. That said, I expect AI rights to be the most important, and the one that starts breaking down first.
  As for the feedback loops you mention:
  - Owners of capital are aware of the consequences of their actions —
    I think this is already sort of not true: I have very little sense of what my current stocks are doing, and my impression is many CEOs don’t really understand most of what’s going on in their companies. Maybe AIs are naturally easier to oversee in a way that helps, maybe they operate at unprecedented scale and speed, overall I’m unsure which way this cuts.
    But also, I expect that the most important consequences of labor displacement by AIs are (1) displacement of human decision making, including over capital allocation and (2) distribution of economic power across humans, and resultant political incentives.
    On top of that, I think a lot of the economic badness will be about aggregate competitive effects, rather than individual obviously-bad actions. If an individual CEO notices their company is doing something bad to stay competitive, which other companies are also doing, then stopping the badness in the world is a lot harder than shutting down a department.
  - Politicians stop systems from doing obviously bad things —
    I also think this is not currently totally true; there is definitely a sense in which some politicians already do not change systems that have bad consequences (terrible material conditions for citizens, at least), partly because they themselves are beholden to some pretty unfortunate incentives. There are bad equilibria within parties, between parties, and between states.
    I also think that the mechanisms which keep countries friendly to citizens specifically are pretty fragile and contingent.
    So essentially, I think the standard of ‘obviously bad’ might not actually be enough.
  - Cultural consumption —
    Here I am confused about where we’re disagreeing, and I think I don’t understand what you’re saying. I’m not sure why people being able to choose the culture they consume would help, and I don’t think it’s something we assume in the paper.
  I hope this sheds some light on things!
  - Fabien Roger 5 Feb 2025 0:20 UTC
    8 points
    0
    Parent
    Thanks for your answer! I find it interesting to better understand the sorts of threats you are describing.
    I am still unsure at what point the effects you describe result in human disempowerment as opposed to a concentration of power.
    I have very little sense of what my current stocks are doing, and my impression is many CEOs don’t really understand most of what’s going on in their companies
    I agree, but there isn’t a massive gap between the interests of shareholders and what companies actually do in practice, and people are usually happy to buy shares of public corporations (buying shares is among the best investment opportunities!). When I imagine your assumptions being correct, the natural consequence I imagine is AI-run companies own by shareholders that get most of the surplus back. Modern companies are a good example of capital ownership working for the benefit of the capital owner. If shareholders want to fill the world with happy lizards or fund art, they probably will be able to, just like current rich shareholders can. I think for this to go wrong for everyone (not just people who don’t have tons of capital) you need something else bad to happen, and I am unsure what that is. Maybe a very aggressive anti-capitalist state?
    I also think this is not currently totally true; there is definitely a sense in which some politicians already do not change systems that have bad consequences
    I can see how this could be true (e.g. the politicians are under the pressures of a public that has been brainwashed by algorithms maximizing engagements in a way that undermines the shareholders’ power without actually redistributing the wealth but instead spends it all on big national AI project that do not produce anything else than more AIs), but I feel like that requires some very weird things to be true (e.g. the algorithms maximizing engagement above results in a very unlikely equilibrium absent an external force that pushes against shareholders and against redistribution). I can see how the state could enable massive AI projects by massive AI-run orgs, but I think it’s way less likely that nobody (e.g. not the shareholders, not the taxpayer, not corrupt politicians, …) gets massively rich (and able to chose what to consume).
    About culture, my point was basically that I don’t think evolution of media will be very disempowerment-favored. You can make better tailored AI -generated content and AI friends, but in most ways I don’t see how it results in everyone being completely fooled about the state of the world in a way that enables the other dynamics.
    I feel like my position is very far from airtight, I am just trying to point at what feel like holes in a story I don’t manage to all fill simultaneously in a coherent way (e.g. how did the shareholders lose their purchasing power? What are the concrete incentives that prevent politicians from winning by doing campaigns like “everyone is starving while we build gold statues in favor of AI gods, how about we don’t”? What prevents people from not being brainwashed by the media that has already obviously brainwashed 10% of the population into preferring AI gold statues to not starving?). I feel like you might be able to describe concrete and plausible scenarios where the vague things I say are obviously wrong, but I am not able to generate such plausible scenarios myself. I think your position would really benefit from a simple caricature scenario where each step feels plausible and which ends up not in power being concentrated in the hands of shareholders / dictators / corrupt politicians, but in power in the hands of AIs colonizing the stars with values that are not endorsed by any single human (nor are a reasonable compromise between human values) while the remaining humans slowly starve to death.
    I was convinced by What does it take to defend the world against out-of-control AGIs? that there is at least some bad equilibrium that is vaguely plausible, in part because he gave an existence proof by describing some concrete story. I feel like this is missing an existence proof (that would also help me guess what your counterarguments to various objections would be).
- Charbel-Raphaël 15 May 2025 15:12 UTC
  LW: 7 AF: 3
  1
  AF Parent
  While I concur that power concentration is a highly probable outcome, I believe complete disempowerment warrant deeper consideration, even under the assumptions you’ve laid out. Here are some thoughts on your specific points:
  1. On Baseline Alignment: You suggest a baseline alignment where AIs are unlikely to engage in egregious lying or tampering (though you also flag 20% for scheming and 10% for unintentional egregious behavior even with prevention efforts, that’s already 30%-ish of risk). My concern is twofold:
    Sufficiency of “Baseline”: Even if AIs are “baseline aligned” to their creators, this doesn’t automatically mean they are aligned with broader human flourishing or capable of compelling humans to coordinate against systemic risks. For an AI to effectively say, “You are messing up, please coordinate with other nations/groups, stop what you are doing” requires not just truthfulness but also immense persuasive power and, crucially, human receptiveness. Even if pausing AI was the correct thing to do, Claude is not going to suggest this to Anthropic folks for obvious reasons. As we’ve seen even with entirely human systems (Trump’s Administration and Tariff), possessing information or even offering correct advice doesn’t guarantee it will be heeded or lead to effective collective action.
    Erosion of Baseline: The pressures described in the paper could incentivise the development or deployment of AIs where even “baseline” alignment features are traded off for performance or competitive advantage. The “AI police” you mention might struggle to keep pace or be defunded/sidelined if it impedes perceived progress or economic gains. “Innovation first!”, “Drill baby drill” “Plug baby plug” as they say
  2. On “No strong AI rights before full alignment”: You argue that productive AIs won’t get human-like rights, especially strong property rights, before being robustly aligned, and that human ownership will persist.
    Indirect Agency: Formal “rights” might not be necessary for disempowerment. An AI, or a network of AIs, could exert considerable influence through human proxies or by managing assets nominally owned by humans who are effectively out of the loop or who benefit from this arrangement. An AI could operate through a human willing to provide access to a bank account and legal personhood, thereby bypassing the need for its own “rights.”
  3. On “No hot global war”:
    You express hope that we won’t enter a situation where a humanity-destroying conflict seems plausible.
    Baseline Risk: While we all share this hope, current geopolitical forecasting (e.g., from various expert groups or prediction markets) often places the probability of major power conflict within the next few decades at non-trivial levels. For a war that makes more than 1M of deaths, some estimates hover around 25%. (But probably your definition of “hot global war” is more demanding)
    AI as an Accelerant: The dynamics described in the paper – nations racing for AI dominance, AI-driven economic shifts creating instability, AI influencing statecraft – could increase the likelihood of such a conflict.
  Responding to your thoughts on why the feedback loops might be less likely if your three properties hold:
  - “Owners of capital will remain humans and will remain aware...able to change the user of that AI labor if they desire so.”
    Awareness doesn’t guarantee the will or ability to act against strong incentives. AGI development labs are pushing forward despite being aware of the risks, often citing competitive pressures (“If we don’t, someone else will”). This “incentive trap” is precisely what could prevent even well-meaning owners of capital from halting a slide into disempowerment. They might say, “Stopping is impossible, it’s the incentives, you know,” even if their pDoom is 25% like Dario, or they might not give enough compute to their superalignment team.
  - “Politicians...will remain aware...able to change what the system is if it has obviously bad consequences.”
    The climate change analogy is pertinent here. We have extensive scientific consensus, an “oracle IPCC report”, detailing dire consequences, yet coordinated global action remains insufficient to meet the scale of the challenge. Political systems can be slow, captured by short-term interests, or unable to enact unpopular measures even when long-term risks are “obviously bad.” The paper argues AI could further entrench these issues by providing powerful tools for influencing public opinion or creating economic dependencies that make change harder.
  - “Human consumers of culture will remain able to choose what culture they consume.”
    You rightly worry about “brain-hacking.” The challenge is that “obviously bad” might be a lagging indicator. If AI-generated content subtly shapes preferences and worldviews over time, the ability to recognise and resist this manipulation could diminish before the situation becomes critical. I think that people are going to LOVE AI, and might take the trade to go faster and be happy and disempowered like some junior developers begin to do on Cursor.
  As a meta point, the fact that the quantity and quality of discourse on this matter is so low, and the fact that people are continuing to say “LET’S GO WE ARE CREATING POWERFUL AIS, and don’t worry, we plan to align them, even if we don’t really know which type of alignment do we really need, and if this is even doable in time” while we have not rigorously assessed all those risks, is really not a good sign.
  At the end of the day, my probability for something in the ballpark of gradual disempowerment / extreme power concentration and loss of democracy is 40%-ish, much higher than scheming (20%) leading to direct takeover (let’s say 10% post mitigation like control).
  - Fabien Roger 17 Jun 2025 20:31 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I think I have answered to some of your objections in answer to another comment.
    I think we would not resolve our disagreement easily in a comment thread: I feel like I am missing pieces of the worldview which make me make wrong predictions about our current world when I try to make back-predictions (e.g. why do we have so much prosperity now if coordination is so hard), and I am also disagreeing on some observations about the current world (e.g. my current understanding of the IPCC report is that it is much less doomy about our current trajectory than you seem to suggest). I’d be happy to chat in person at some point to sort this out!
    - Fabien Roger 17 Jun 2025 20:36 UTC
      LW: 4 AF: 5
      0
      AF Parent
      On a lighter note, I feel like many people here are much more sympathetic to “power concentration bad” when thinking about the gradual decline of democracy than when facing concerns about China winning the AI race. I think this is mostly vibes, I don’t think many people are actually making the mistake of choosing their terminal values based on whether it results in the conclusion “we should stop” vs “we should go faster” (+ there are some differences between the two scenarios), but I really wanted to make this meme:
      - yams 18 Jun 2025 17:22 UTC
        3 points
        2
        Parent
        I’m confused or don’t find this funny or don’t think it goes through at all or would like you to explain the joke.
        The counterfactual to China winning is the US winning, which still concentrates power; the more accurate version of the second panel would be “power concentration bad unless the US does it.”
        And for anyone who thinks power concentration is bad unless the US does it, they have to engage with the ‘AGI-enabled decline of democracy’ arguments in order to begin to advocate for their position seriously.
        I don’t think there are just ‘some differences’ here; I think this is a complete disanalogy.
        Buck 18 Jun 2025 17:27 UTC
        4 points
        2
        Parent
        One reason to think that the US winning concentrates power less is that the US is a democracy with a strong tradition of maintaining individual rights and a reasonably strong history (over the last 80 years) of pursuing a world order where it benefits from lots of countries being pretty stable and not e.g. invading each other.
        yams 18 Jun 2025 17:39 UTC
        4 points
        3
        Parent
        Yes, this is the argument I was anticipating with:
        for anyone who thinks power concentration is bad unless the US does it, they have to engage with the ‘AGI-enabled decline of democracy’ arguments in order to begin to advocate for their position seriously
        I don’t think you just get it for free; I think you need to explain why you expect this to hold as shit gets crazy. It’s prima facie reasonable to think that the US is likely to act more responsibly than China here; I don’t think it’s prima facie reasonable to think that any actor is likely to act especially responsibly (e.g. in a way that sidesteps the concerns of the character from panel 1).
        Power concentration bad so power concentration bad, even if by a marginally more benevolent actor.
        (likely we disagree on the delta between future US and future China here, but I’d like to avoid arguing that point; I just mean to point out that the absolute reasonableness matters, and that flatly synonymizing the US with freedom, casting it as an actor incapable of the bad kind of concentration of power*, and China with authoritarianism, casting it as an actor obligated to commit the bad kind of concentration of power, is a mistake)
        The US loses and democracy loses are just not the same situation; the US is an example of a democracy; it’s not The Spirit Of Democracy Itself.
        *without addressing the gradual disempowerment arguments ahead of time
        Fabien Roger 20 Jun 2025 14:11 UTC
        3 points
        0
        Parent
        You don’t get it for free, but I think it’s reasonable to assume that P(concentrated power | US wins) is smaller than P(concentrated power | China wins) given that the later is close to 1 (except if you are very doomy about power concentration, which I am not), right? Not claiming the US is more reasonable, just that it’s more likely that a western democracy winning makes power concentration less likely to happen than if it’s a one-party state. It’s also possible I am overestimating P(concentrated power | China wins), I am not an expert in Chinese politics.
        yams 20 Jun 2025 15:56 UTC
        2 points
        0
        Parent
        I disagree on your assessment of both nations, and am pretty doomy about concentration of power.
        I think how a nation with a decisive strategic advantage treats the rest of the world has more to do with its decisive strategic advantage and its needs, and less to do with its flag, or even history.
        Anyway, my main point was structural: if panel 2 depends on holding very particular views regarding panel 1, the parallelism is lost and it hurts the joke.
        Thanks for the response!