Noosphere89 comments on deep’s Shortform

Noosphere89 10 Jun 2025 13:53 UTC
3 points
0
IMO, the best argument for AI safety looks something like this:
1. Eventually within this century, someone will deploy AIs that are able to make humans basically worthless at ~all jobs at a minimum.
2. Once you don’t depend on anyone else to survive, and once the society you are in is economically worthless or even has negative value from a selfish perspective because they can’t do anything relevant, and they cannot resist what you can do, there’s no reason not to steal from them/kill them anymore, because their property/land/capital isn’t worthless, but their labor is worthless, and an argument strengthener is whether AIs can develop technology that can make expropriation recover more of the value of the property.
3. Thus, you need AIs at this power level to terminally care about people/beings that have no power whatsoever, and they need to terminally value survival of beings that have 0 power/leverage, including humans.
4. This might be difficult to achieve, or not difficult, but we don’t yet know how difficult it is to align AIs that could displace all humans at jobs, and this is worrisome given the empirical evidence of how powerful entities have treated those with much less power, and importantly the end of World War II-today period of powerful people treating less powerful people well fundamentally rests on stuff that will break when AIs can take ~all the jobs.
5. We’ve never had to solve value alignment before, because the fact that everyone depends on everyone else for power means that institutional design/design that is robust to value misalignment works, and we can’t change people’s values.
6. Thus, there’s a reasonable chance of existential catastrophe happening if we build AI that can replace us at all jobs before we do serious alignment.
I’ll flag that I think pure LLMs are less relevant to takeover concerns than I once thought, so I am less optimistic than in 2024, and I’ll also say that the level of awareness now is unfortunately not very predictive of stuff like “If an AI model was clearly hacking it’s data-center, would there be a strong response like pausing/shutting down the AI model?”, and Buck gives some good reasons on why strong responses may not happen:

https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai

So while I don’t think value alignment is sufficient, I do think something like value alignment will be necessary in futures where AI controls everything and yet we have survived for more than a decade.
- deep 12 Jun 2025 2:24 UTC
  1 point
  0
  Parent
  Thanks for the reply!
  I notice I’m confused about how you think these thoughts slot in with mine. What you’re saying feels basically congruent with what I’m saying. My core points about orienting to safety, which you seem to agree with, are A) safety is necessary but not sufficient, and B) it might be easier to solve than other things we also need to get right. Maybe you disagree on B?
  I will note—to me, your points ¹⁄₂ also point strongly towards risks of authoritarianism & gradual disempowerment. It feels like a non sequitur to jump from them to point 3 about safety—I think the natural follow-up from someone not experienced with the path-dependent history of AI risk discourse would be “how do we make society work given these capabilities?” I’m curious if you left out that consideration because you think it’s less big than safety, or because you were focusing on the story for safety in particular.
  - Noosphere89 12 Jun 2025 3:22 UTC
    2 points
    0
    Parent
    I notice I’m confused about how you think these thoughts slot in with mine. What you’re saying feels basically congruent with what I’m saying. My core points about orienting to safety, which you seem to agree with, are A) safety is necessary but not sufficient, and B) it might be easier to solve than other things we also need to get right. Maybe you disagree on B?
    I don’t disagree on A or B, for the record, and while I’ve updated on AI alignment being harder than I used to think, I’m still relatively uncertain about how difficult AI alignment actually is.
    I will note—to me, your points ¹⁄₂ also point strongly towards risks of authoritarianism & gradual disempowerment.
    I actually agree with this, but I’ll flag that I think the amount of value alignment that is necessary from AIs does mean that authoritarianism is likely to be way less bad for most human values (not all), because I view democracy and a lot of other governance structures as an attempt to rely less on value alignment and more on incentives, but for reasons I’ll get to later, I do think that value alignment is just way, way more necessary for you to survive under AI governance than under human governance, which brings us up to this:
    It feels like a non sequitur to jump from them to point 3 about safety—I think the natural follow-up from someone not experienced with the path-dependent history of AI risk discourse would be “how do we make society work given these capabilities?” I’m curious if you left out that consideration because you think it’s less big than safety, or because you were focusing on the story for safety in particular.
    In a literal sense, society will continue to work, even if they are warped immensely by AIs, but the reason why I left out the consideration of “how can we get a situation where we can maintain our survival without requiring the value alignment of the most powerful beings (by default AIs) once they take all human jobs?” is because I think it’s basically impossible to get an equilibrium where humans survive AI rule without assumptions around what the AI’s utility functions/values are, unlike in traditional economic modelling.
    The reasons for this are 2 fold:
    The human’s land/capital/property isn’t worthless, but their labor is worthless, and thus from a selfish perspective the reason to keep them alive/in good condition is gone, and you have no reason to invest in anything that helps them earn stuff to buy goods to fuel their consumption. Indeed, stealing property/killing the human is from a selfish perspective valuable, at least all other things being held equal.
    Indeed, I like this quote from the intelligence curse explaining why you wouldn’t satisfy non-rich human demand to instead satisfy rich human/machine demand:
    A common rebuttal is that some jobs can never be automated because we will demand humans do them.
    For example, teachers. Most parents would probably strongly prefer a real, human teacher to watch their kids throughout the day. But this argument totally misses the bigger picture: it’s not that there won’t be a demand for teachers, it’s that there won’t be an incentive to fund schools. This argument repeats ad nauseam for anything that invests in regular people’s productive capacity, any luxury that relies on their surplus income, or any good that keeps them afloat. By default, powerful actors won’t build things that employ humans or provide them resources, because they won’t have to.
    From this link:
    https://intelligence-curse.ai/defining/
    2. Conflict isn’t costly for AIs against baseline humans once AIs takeover, and thus there’s no ability to actually threaten them into giving us a share of the pie.
    Conflict between AIs and humans (once AI has taken over), if it did happen would be closer to the European conflicts against Africa and North and South America from 1500-1900 at best, and probably closer to humanity vs most wild animal conflicts at worse, which ended in annihilation for tens or hundreds of thousands of species, or more.
    Or put it into shorter terms by Jeremiah England:
    It seems like there are two main reasons for treating someone well who you don’t care about: (1) they perform better for you when you do, (2) they will raise hell if you don’t.
    https://x.com/JeremiahEnglan5/status/1929371594553438245
    This is why I said value alignment of AIs are ultimately necessary, and why you need to have AIs that terminally value beings that have 0 or negative usefulness for the AI in an economic sense thriving/surviving, because institutional solutions don’t work if they are trivial and beneficial to subvert, and economics favors AIs killing all humans to get more land and capital if the AIs are selfish.
    - deep 12 Jun 2025 14:13 UTC
      1 point
      0
      Parent
      Apologies for the scrappiness of the below—I wanted to respond but I have mostly a scattering of thoughts rather than solid takes.
      I like the intelligence curse piece very much—it’s what I meant to reference when I linked the Turing Trap above, but I couldn’t remember the title & Claude pointed me to that piece instead. I agree with everything you’re saying directionally! But I feel some difference in emphasis or vibe that I’m curious about.
      -
      One response I notice having to your points is: why the focus on value alignment?
      “We could use intent alignment / corrigibility to avoid AIs being problematic due to these factors. But all these issues still remain at higher levels: the human-led organizations in charge of those AIs, the society in which those organizations compete, international relations & great-power competition.”
      And conversely: “if we have value alignment, I don’t think there’s a guarantee that we wind up in a basin of convergent human values, so you still have the problem of—whose interests are the AIs being trained & deployed to serve? Who gets oversight or vetos on that?”
      (Using quotes bc these feel more like ‘text completions from system 1’ than all-things-considered takes from system 2.)
      -
      Maybe there’s a crux here around how much we value the following states: AI-led world vs some-humans-led world vs deep-human-value-aligned world.
      I have some feeling that AI-risk discourse has historically had a knee-jerk reaction against considering the following claims, all of which seem to me like plausible and important considerations:
      It’s pretty likely we end up with AIs that care about at least some of human value, e.g. valuing conscious experience. (at least if AGIs resemble current LLMs, which seem to imprint on humans quite a lot.)
      AI experiences could themselves be deeply morally valuable, even if the AIs aren’t very human-aligned. (though you might need them to at minimum care about consciousness, so they don’t optimize it away)
      A some-humans-led world could be at least as bad as an AI-led world, and very plausibly could have negative rather than zero value.
      I think this is partly down to founder effects where Eliezer either didn’t buy these ideas or didn’t want to emphasize them (bc they cut against the framing of “alignment is the key problem for all of humanity to solve together, everything else is squabbling over a poisoned banana”).
      -
      I also notice some internal tension where part of me is like “the AIs don’t seem that scary in Noosphere’s world”. But another part is like “dude, obviously this is an accelerating scenario where AIs gradually eat all of the meaningful parts of society—why isn’t that scary?”
      I think where this is coming from is that I tend to focus on “transition dynamics” to the AGI future rather than “equilibrium dynamics” of the AGI future. And in particular I think international relations and war are a pretty high risk throughout the AGI transition (up until you get some kind of amazing AI-powered treaty, or one side brutally wins, or maybe you somehow end up in a defensively stable setup but I don’t see it, the returns to scale seem so good).
      So maybe I’d say “if you’re not talking a classic AI takeover scenario, and you’re imagining a somewhat gradual takeoff,
      my attention gets drawn to the ways humans and fundamental competitive dynamics screw things up
      the iterative aspect of gradual takeoff means I’m less worried about alignment on its own. (still needs to get solved, but more likely to get solved.)”
      - Noosphere89 12 Jun 2025 17:01 UTC
        3 points
        0
        Parent
        Some thoughts on this:
        One response I notice having to your points is: why the focus on value alignment?
        “We could use intent alignment / corrigibility to avoid AIs being problematic due to these factors. But all these issues still remain at higher levels: the human-led organizations in charge of those AIs, the society in which those organizations compete, international relations & great-power competition.”
        And conversely: “if we have value alignment, I don’t think there’s a guarantee that we wind up in a basin of convergent human values, so you still have the problem of—whose interests are the AIs being trained & deployed to serve? Who gets oversight or vetos on that?”
        (Using quotes bc these feel more like ‘text completions from system 1’ than all-things-considered takes from system 2.)
        You’ve correctly noted the issue of why lots of people may not be safe even in a physical sense even assuming value alignment/corrigibility/intent alignment/instruction following is solved, and I do think you are correct that there is no guarantee that we wind up in a basin of convergence, and I’d even argue that it’s unlikely to converge and instead diverge because there is no 1 moral reality, and there are an infinite amount of correct moralities/moral realities, so yeah the oversight problem is pretty severe.
        Maybe there’s a crux here around how much we value the following states: AI-led world vs some-humans-led world vs deep-human-value-aligned world.
        I have some feeling that AI-risk discourse has historically had a knee-jerk reaction against considering the following claims, all of which seem to me like plausible and important considerations:
        It’s pretty likely we end up with AIs that care about at least some of human value, e.g. valuing conscious experience. (at least if AGIs resemble current LLMs, which seem to imprint on humans quite a lot.) AI experiences could themselves be deeply morally valuable, even if the AIs aren’t very human-aligned. (though you might need them to at minimum care about consciousness, so they don’t optimize it away) A some-humans-led world could be at least as bad as an AI-led world, and very plausibly could have negative rather than zero value. I think this is partly down to founder effects where Eliezer either didn’t buy these ideas or didn’t want to emphasize them (bc they cut against the framing of “alignment is the key problem for all of humanity to solve together, everything else is squabbling over a poisoned banana”).
        So I’ll state a couple of things here.
        On your first point, I think that AGIs probably will be quite different from current LLMs, mostly due to future AIs having continuous learning, a long-term memory and being more data efficient/sample efficient, and the most accessible way to make AIs more capable will route through using more RL.
        On your second point, this as always depends on your point of view, because once again there’s no consistent answer that holds across all valid moralities.
        On your third point, again this depends on your point of view, but if I use my inferred model of human values where most humans strongly disvalue dying/being tortured, I agree that a some-humans led world is at least as bad as an AI world, because I consider most of what makes humans being willing to be prosocial in situations where it’s low cost to do so to be unfortunately held up by things that are absolutely shredded once some humans can just not depend on other human beings anymore for a rich life, and not based on what the human values internally.
        I also notice some internal tension where part of me is like “the AIs don’t seem that scary in Noosphere’s world”. But another part is like “dude, obviously this is an accelerating scenario where AIs gradually eat all of the meaningful parts of society—why isn’t that scary?”
        I think where this is coming from is that I tend to focus on “transition dynamics” to the AGI future rather than “equilibrium dynamics” of the AGI future. And in particular I think international relations and war are a pretty high risk throughout the AGI transition (up until you get some kind of amazing AI-powered treaty, or one side brutally wins, or maybe you somehow end up in a defensively stable setup but I don’t see it, the returns to scale seem so good).
        Yes, this explains the dynamics of why I was more negative than you in your post, and the point was to argue against people like @Matthew Barnett and a lot of other people’s arguments that AI alignment doesn’t need to be solved, because AIs will follow human made laws and there will be enough positive sum trades such that the AIs, even if selfish will decide to not kill humans.
        And my point is that unfortunately, in a post-AI takeover world any trade between most humans and AIs would be closer to an AI giving away stuff in return for nothing given up by the human, because the human as a living entity has 0 value, or even negative value from an economics perspective, and their land and property/capital are valuable, but are very easily stolen.
        So if an AI didn’t terminally value the survival/thriving of people who have 0/negative value in an economics sense, then it’s quite likely that outright killing/warping the human severely is unfortunately favorable to the AI’s interest.
        In essence, I was trying to say why conditional on you not controlling the AI (which I think happens in the long run), you really do need assumptions on the AI’s values to a much greater extent to survive than current humans in current human institutions.
        So maybe I’d say “if you’re not talking a classic AI takeover scenario, and you’re imagining a somewhat gradual takeoff,
        my attention gets drawn to the ways humans and fundamental competitive dynamics screw things up
        the iterative aspect of gradual takeoff means I’m less worried about alignment on its own. (still needs to get solved, but more likely to get solved.)”
        I do agree that in more gradual takeoffs, humans/competitive dynamics matter more, and alignment is more likely to be solved, defusing the implications I made (with the caveat that the standard of what counts as an AI being aligned will have to rise to extreme levels over time in a way people are not prepared for), so I agree that the alignment problem is less urgent, though I do think at least in the long run and even arguably in the medium term, a lot of the problems of competitive dynamics/human flaws screwing things up will ultimately require as a baseline leaders who actually value people/beings that have 0 power surviving and thriving, because if you do not have this, none of the other proposed solutions work, and I think it’s really important to say that compared to the 19th-21st century era in democracies, values are going to matter a lot more to how much humans thrive or die.
        deep 12 Jun 2025 17:46 UTC
        1 point
        0
        Parent
        OK, cool, I think I understand where you’re coming from much better now. Seems like we basically agree and were just emphasizing different things in our original comments!
        I’m in violent agreement that there’s a missing mood when people say “AIs will follow the law”. I think there’s something going on where people are like “but liberalism / decentralized competition have worked so well” and ignoring all the constraints on individual actors that make it so. Rule of law, external oversight, difficulty of conspiring with other humans, inefficiencies of gov’t that limit its ability to abuse power, etc.
        And those constraints might all fall away with the AGI transition. That’s for a number of reasons: ownership of AGI could concentrate power; AGI complements existing power bases (e.g. govt has the authority but not a great ability to selectively enforce laws to silence opponents as mass-scale), it reduces the need for conspirators. As you note, it brings down others’ value as trading partners & collaborators. And takeoff dynamics could make things less like an iterated game and more like a one-shot. *taps head* can’t be punished if all your opponents are dead.
        (I’m guessing you’d agree with all this, just posting to clarify where my head is at)