Nina Panickssery comments on The title is reasonable

Nina Panickssery 20 Sep 2025 21:36 UTC
9 points
0
I don’t “get off the train” at any particular point, I just don’t see why any of these steps are particularly likely to occur. I agree they could occur, but I think a reasonable defense-in-depth approach could reduce the likelihood of each step enough that likelihood of the final outcome is extremely low.

I don’t think anyone said “coherent”
It sounds like your argument is the AI will start with with ‘psuedo-goals’ that conflict and will be eventually be driven to resolve them into a single goal so that it doesn’t ‘dutch-book itself’ i.e. lose resources because of conflicting preferences. So it does rely on some kind of coherence argument, or am I misunderstanding?
- Raemon 20 Sep 2025 21:48 UTC
  6 points
  3
  Parent
  Okay yes I do think coherence is eventually one of the important gears. My point with that sentence here is that the coherence can come much later, and isn’t the crux for why the AI gets started in the direction that opposes human interests.
  The important first step is “if you give the AI strong pressure to figure out how to solve problems, and keep amping that up, it will gain the property of ’relentlessness.”
  If you don’t put pressure on the AI to do that, yep, you can get a pretty safe AI. But, that AI will be less useful, and there will be some other company that does keep trying to get relentlessness out of it. Eventually, somebody will succeed. (This is already happening)
  If an AI has “relentlessness”, as it becomes smarter, it will eventually stumble into strategies that explore circumventing safeguards, because it’s a true fact about the world that those will be useful.
  If you keep your AI relatively weak, it may not be able to circumvent the defense-in-depth because you did a pretty good job defending in depth.
  But, security is hard, the surface area for vulnerability is huge, and it’s very hard to defend in depth against a sufficiently relentless and smart adversary.
  Could we avoid this by not building AIs that are not relentless, and/or smarter than our defense-in-depth? Yes, but, to stop anyone from doing that ever, you somehow need to ban that globally. Which is the point of the book.
  Maybe this does turn out to take 100 years (I think that’s a strange belief to have given current progress, but, it’s a confusing topic and it’s not prohibited). But, that just punts the problem to later.
  - Nina Panickssery 21 Sep 2025 0:30 UTC
    5 points
    0
    Parent
    This is an argument for why AIs will be good at circumventing safeguards. I agree future AIs will be good at circumventing safeguards.
    By “defense-in-depth” I don’t (mainly) mean stuff like “making the weights very hard to exfiltrate” and “monitor the AI using another AI” (though these things are also good to do). By “defense-in-depth” I mean at every step, make decisions and design choices that increase the likelihood of the model “wanting” (in the book sense) to not harm (or kill) humans (or to circumvent our safeguards).
    My understanding is that Y&S think this is doomed because ~”at the limit of <poorly defined, handwavy stuff> the model will end up killing us [probably as a side-effect] anyway” but I don’t see any reason to believe this. Perhaps it stems from some sort of map-territory confusion. An AI having and optimizing various real-world preferences is a good map for predicting its behavior in many cases. And then you can draw conclusions about what a perfect agent with those preferences would do. But there’s no reason to believe your map always applies.
    - Joe Rogero 22 Sep 2025 15:41 UTC
      3 points
      0
      Parent
      By “defense-in-depth” I mean at every step, make decisions and design choices that increase the likelihood of the model “wanting” (in the book sense) to not harm (or kill) humans (or to circumvent our safeguards).
      Can you give an example or three of such a decision or design choice?
      In my model of the situation, the field of AI research mostly does not know how specific decisions and design choices affect the inner drives of AIs. External behavior in specific environments can be nudged around, but inner motivations largely remain a mystery. So I’m highly skeptical that researchers can exert much deliberate causal influence on the inner motivations.
      A related possible-crux is that, while
      An AI having and optimizing various real-world preferences is a good map for predicting its behavior in many cases.
      ...I don’t think it’s a good map for predicting behavior in the cases that matter most, in part because those cases tend to occur at extremes. And even if it were, to the extent that current AIs seem to be optimizing for real-world preferences at all, they don’t seem to be very nice ones; see for example the tendency to feed the delusions of psychotic people when elsewhere claiming that’s a bad thing to do.
    - Raemon 22 Sep 2025 18:12 UTC
      2 points
      0
      Parent
      Oh, if that’s what you meant by Defense in Depth, as Joe said, the book’s argument is “we don’t know how.”
      At weak capabilities, our current ability to steer AI is sufficient, because mistakes aren’t that bad. Anthropic is trying pretty hard with Claude to build something that’s robustly aligned, and it’s just quite hard. When o3 or claude cheat on programming tasks, they get caught, and the consequences aren’t that dire. But when there are millions of iterations of AI-instances making choices, and when it is smarter than humanity, the amount of robustness you need is much much higher.
      My understanding is that Y&S think this is doomed because ~”at the limit of <poorly defined, handwavy stuff> the model will end up killing us [probably as a side-effect] anyway”
      [...]
      And then you can draw conclusions about what a perfect agent with those preferences would do. But there’s no reason to believe your map always applies.
      I’m not quite sure how to parse this, but, it sounds like you’re saying something like “I don’t understand why we should expect in the limit something to be a perfect game theoretic agent.” The answer is “because if it wasn’t, that wouldn’t be the limit, and the AI would notice it was behaving suboptimally, and figure out a way to change that.”
      Not every AI will do that, automatically. But, if you’re deliberately pushing the AI to be a good problem solver, and if it ends up in a position where it is capable of improving it’s cognition, once it notices ‘improve my cognition’ as a viable option, there’s not a reason for it to stop.
      ...
      It sounds like a lot of your objection is maybe to the general argument “things that can happen, eventually will.” (in particular, when billions of dollars worth of investment are trying to push towards things-nearby-that-attractor happening).
      (Or, maybe more completely: “sure, things that can happen eventually will, but meanwhile a lot of other stuff might happen that changes how path-dependent-things will play out?”)
      I’m curious how loadbearing that feels for the rest of the arguments?
      - Nina Panickssery 22 Sep 2025 19:19 UTC
        5 points
        0
        Parent
        Anthropic is trying pretty hard with Claude to build something that’s robustly aligned, and it’s just quite hard. When o3 or claude cheat on programming tasks, they get caught, and the consequences aren’t that dire. But when there are millions of iterations of AI-instances making choices, and when it is smarter than humanity, the amount of robustness you need is much much higher.
        I agree with this. If the risk being discussed was “AI will be really capable but sometimes it’ll make mistakes when doing high-stakes tasks because it misgeneralized our objectives” I would wholeheartedly agree. But I think the risks here can be mitigated with “prosaic” scalable oversight/control approaches. And of course it’s not a solved problem. But that doesn’t mean that the current status quo is the AI misgeneralizing so badly that it doesn’t just reward hack on coding unit tests but also goes off and kills everyone. Claude, in its current state, isn’t not killing everyone just because it isn’t smart enough.
        The answer is “because if it wasn’t, that wouldn’t be the limit, and the AI would notice it was behaving suboptimally, and figure out a way to change that.”
        Not every AI will do that, automatically. But, if you’re deliberately pushing the AI to be a good problem solver, and if it ends up in a position where it is capable of improving it’s cognition, once it notices ‘improve my cognition’ as a viable option, there’s not a reason for it to stop.
        Why are you equivocating between “improve my cognition”,”behave more optimally” and “resolve different drives into a single coherent goal (presumably one that is non-trivial, i.e. some target future world state)”. If “optimal” is synonymous with utility-maximizing, then the fact that utility-maximizers have coherent preferences is trivial. You can fit preferences and utility functions to basically anything.
        Also, why do you think that insofar as a coherent, non-trivial goal emerges, it is likely to eventually result in humanity’s destruction. I find the arguments unconvincing here also; you can’t just appeal to some unjustified prior over the “space of goals” (whatever that means). Empirically, the opposite seems to be true. Though you can point to OOD misgeneralization cases like unit test reward hacking, in general LLMs are both very general and aligned enough to mostly want to do helpful and harmless stuff.
        It sounds like a lot of your objection is maybe to the general argument “things that can happen, eventually will.” (in particular, when billions of dollars worth of investment are trying to push towards things-nearby-that-attractor happening).
        Yes, I object to the “things that can happen, eventually will” line of reasoning. It proves way too much, including contradictory facts. You need to argue why one thing is more likely than another.
        if that’s what you meant by Defense in Depth, as Joe said, the book’s argument is “we don’t know how
        We will never “know how” if your standard is “provide an exact proof that the AI will never do anything bad”. We do know how to make AIs mostly do what we want, and this ability will likely improve with more research. Techniques in our toolbox include pretraining on human-written text (which elicits roughly correct concepts), instruction-following finetuning, RLHF, model-based oversight/RLAIF.
        What links here?
        LLMs are badly misaligned by Joe Rogero (5 Oct 2025 14:00 UTC; 11 points)
        Raemon 24 Sep 2025 4:31 UTC
        4 points
        0
        Parent
        Yes, I object to the “things that can happen, eventually will” line of reasoning.
        Nod, makes sense, I think I want to just focus on this atm.
        (also, btw I super appreciate you engaging, I’m sure you’ve argued a bunch with folk like this already)
        So here’s the more specific thing I actually believe. I agree things don’t automatically happen eventually just because they can. At least, not automatically on relevant timescales. (i.e. eventually infinite monkeys mashing keyboards will produce shakespeare, but, not for bazillions of years)
        The argument is:
        If something can happen
        and there’s a fairly strong reason to expect some process to steer towards that thing happening
        and there’s not a reason to expect some other processes to steer towards that thing not happening
        ...then the thing probably happens eventually, on a somewhat reasonable timescale, all else equal. (“reasonable” timescale depends on how fast whatever steering process works. i.e. stellar evolution might take billions of years, evolution millions of years, and human engineers thousands of years).
        For example, when the first organisms appeared on earth and begin to mutate, I think a smart outside observer could predict “evolution will happen, and unless all the replicators die out, there will probably eventually be a variety of complex organisms.”
        But, they wouldn’t be able to predict that any particular complex mutation would happen. (for example, flying birds, or human intelligence). It was a long time before we got birds. We only have one Earth to sample from, but we’re already ~halfway between the time the earth was born and and when the sun engulfs it, so, it’s not too surprising if evolution never got around to the combination of traits that birds have.
        I think this is a fairly basic probability argument? Like, if each day, there’s an n% chance of a beneficial mutation occuring (and then it’s host surviving and replicating, given a long enough chunk of time, it would (eventually) be pretty surprising if it never happened. Maybe any specific mutation would be difficult to predict happening in 10 billion years. But, if we had trillions and trillions of years, it would be a pretty weird claim that it’d never happen.
        Similarly, if each day there are N engineers thinking about how to solve a problem, and making a reasonable effort to creatively explore the space of ways of solving the problem, and we know the problem is solveable, then each day there’s an n% chance of one of them stumbling towards the solution.
        (In both evolution and the engineer’s cases, a thing that makes this a lot easier is that the search isn’t completely blind. Partial successes can compound. Evolution wasn’t going to invent birds in one go, that would be indeed be way too combinatorically hard. But, it got to invent “wiggling appendages”, which then opened up new, better search space of “particular ways of wiggling appendages” which eventually leads to locomotion and then flight)
        How fast you should expect that to happens on how much resources are being thrown at it.
        (maybe worth reiterate: I don’t think the “things that can happen, eventually will” applies to AI in the near future, that’s a much more specific claim, and Eliezer-et-all are much less confident about that).
        There exists some math, for a given creation/steering-system and some models of how often it generates new ideas and how fast those ideas then reach saturation, for “at what point it becomes more surprising, that a thing has never happened, than that it’s happened at least once.”
        We can’t perfectly predict it, but it’s not something we’re perfectly ignorant about. It is possible to look at cells replicating, and model how often mutations happen, and what distribution of mutations tend to happen, and then make predictions about what distribution of mutations you might expect to see, how quickly.
        I think early hypothetical non-evolved-but-smart aliens observing early evolution for a thousand years, wouldn’t be able to predict “birds”, but they would might be able to predict “wiggling appendages” (or at least whatever was earlier on the tech-tree than wiggling appendages. I’m meaning to include single-cells that are, like, twisting slightly here)
        Looking at the rate of human engineering, it’d be pretty hard to predict exactly when heavier-than-air-flight would be invented, but, once you’ve reached the agricultural age and you’ve gotten to start seeing specialization of labor and creativity and spreading of ideas, and the existence-proof of birds, I think a hypothetical smart observer could put upper and lower bounds on when humans might eventually figure out it out. It would be weird if it took literally billions of years, given that it only took evolution like a few billion years and evolution was clearly way slower.
        And, I haven’t yet laid out any of the specifics of “and here are the upper and lower bounds on long it seems like it should plausibly take humanity to invent superintelligence.” I don’t think I’d get a more specific answer than “hundreds or possibly thousand of years.”, but I think it is something that in principle has an answer and you should be able to find/evaluate evidence that narrows down the answer.
        (I am still interested in finding nearterm things to bet on since it sounds like I’m more confident than you that general-intelligence-looking things are decently [like, >50%] likely to happen in the next 20 years or so)
        Nina Panickssery 24 Sep 2025 16:22 UTC
        2 points
        0
        Parent
        I agree things don’t automatically happen eventually just because they can. At least, not automatically on relevant timescales. (i.e. eventually infinite monkeys mashing keyboards will produce shakespeare, but, not for bazillions of years)
        Not important to your general point, but here I guess you run into some issues with the definition of “can”. You could argue that if something doesn’t happen it means it couldn’t have happened (if the universe is deterministic). And so then yes, everything that can happen, actually happens. But that isn’t the sense in which people normally use the word “can”. Instead it’s reasonable to say “it’s possible my son’s first word is ‘Mama’”, “it’s possible my son’s first word is ‘Papa’”, both of these things can happen (i.e. they are not prohibited by any natural laws that we know of). But only one of these things can be true; in many situations we’d say that two mutually incompatible events “can happen”. And therefore it’s not just a matter of timescale.
        The argument is:
        If something can happen
        and there’s a fairly strong reason to expect some process to steer towards that thing happening
        and there’s not a reason to expect some other processes to steer towards that thing not happening
        ...then the thing probably happens eventually, on a somewhat reasonable timescale, all else equal.
        Sure, I agree with that. I think this makes superintelligence much more likely than it otherwise would be (because it’s not prohibited by any laws of physics that we know of, and people are trying to build it, and no-one is effectively preventing it from being built). But the same argument doesn’t apply to misaligned superintelligence or other doom-related claims. In fact, the opposite is true.
        Superintelligence not killing everyone is not prohibited by the laws of physics
        People are trying to ensure superintelligence doesn’t kill everyone
        No-one is trying to make superintelligence kill everyone
        So you could apply a similarly-shaped argument to “prove” that aligned superintelligence is coming on a “somewhat reasonable timescale”.
        Raemon 24 Sep 2025 18:18 UTC
        2 points
        0
        Parent
        Yeah, when I say “things that can happen most likely will”, I don’t mean “in any specific case.” A given baby’s first words can’t be both mama and papa. But, there’s a range of phonemes that babies can make. And over time, eventually every combination of first 2-4 phonemes will happen to be a baby’s first “word”.
        Before respond to the rest, I want to check back on, this bit, at the meta level:
        why do you think that insofar as a coherent, non-trivial goal emerges, it is likely to eventually result in humanity’s destruction?
        This is something Eliezer (and I think I) have written about recently, which I think you read. (In the chapter “It’s favorite things”).
        I get that you didn’t really buy those arguments as being dominating. But, a feeling I get when reading your question there is something like “there are a lot of moving parts to this argument, and when we focus on one for awhile the earlier bits lose salience.”
        And, perhaps similar to “things that can happen, eventually will, given enough chances, unless stopped”, another pretty loadbearing structural claim is:
        “It is possible to just actually exhaustively think through a large number of questions and arguments, and for each one, get to a pretty confident state of what the answer to that question is.”
        And the,n it’s at least possible to make a pretty good guess about how things will play out, at least if we don’t learn new information.
        And maybe you can’t get to 100% confidence. But you can rule out things like “well, it won’t work unless Claim A turns out to be false, even though it looks most likely true.” And this constraints what types of worlds you might possibly be living in.
        Or, maybe you can’t reach that even a moderate confidence with your current knowledge, but, you can see which things you’re still uncertain of, which if you became more certain of, would change the overall picture.
        ...
        (i.e. the “unless something stops it” clause in the “if it can happen, it will, unless stopped” argument, means we live in worlds whether either it eventually happens, or is stopped, and then we can start asking “okay, what are the ways it could hypothetically be stopped? how likely do those look?”)
        “Things that can happen, eventually will, given enough chances, unless stopped” is one particular argument that is relevant to some of the subpoints here. Yesterday you were like “yeah I don’t buy that.” I spelled out what I meant, and its sounds like now your position is “okay, I do see what you mean there, but I don’t see how it leads to the final conclusion.”
        There are a lot more steps, at that level of detail, before I’d expect you to believe something more similar to what I believe.
        I’m super grateful for getting to talk to you about this so far, I’ve enjoyed the convo and it’s been helpful to me for getting more clarity on how all the pieces fit together in my own head. If you wanna tap out, seems super understandable.
        But, the thing I am kinda hoping/asking for is for you to actually track all all the arguments as they build, and if a new argument changes your mind on a given claim, track how that fits into all the existing claims and whether it has any new implications.
        ...
        I’m not quite sure how you’re relating to your previous beliefs about “if it can happen, it will” and the arguments I just made. I’m guessing it wasn’t exactly an update for you so much as a “reframing.”
        But, it sounds like you now understand what I meant, and why it at least means “the fact superintelligence is possible, and that people are trying, means that it’ll probably happen [in some timeframe]”.
        And, while I haven’t yet proven all the rest of the steps of the argument to you, like… I’m asking you to notice that I did have an answer there, and there are other pieces that I think I also have answers to. But the complete edifice is indeed multiple books worth, and because each individual (like you) has different cruxes, it’s hard to present all the arguments in a succinct, compelling way.
        But, I’m asking if you’re up for at least being willing to entertain the structure of “maybe, Ray will be right that there is a large-but-finite set of claims, and it’s possible to get enough certainty on each claim to at least put pretty significant bounds on how unaligned AI may play out.”
        Nina Panickssery 25 Sep 2025 15:01 UTC
        3 points
        0
        Parent
        I’m asking if you’re up for at least being willing to entertain the structure of “maybe, Ray will be right that there is a large-but-finite set of claims, and it’s possible to get enough certainty on each claim to at least put pretty significant bounds on how unaligned AI may play out
        Certainly, I could be wrong! I don’t mean to:
        Dismiss the possibility of misaligned AI related X-risk
        Dismiss the possibility that your particular lines of argument make sense and I’m missing some things
        And I think caution with AI development is warranted for a number of reasons beyond pure misalignment risk.
        But it’s a little worrying when a community widely shares a strong belief in doom while implying that the required arguments are esoteric and require lots of subtle claims, each of which might have counterarguments, but which overall will eventually convince you. 1a3orn has a good essay about this: https://1a3orn.com/sub/essays-ai-doom-invincible.html.
        I think having intuitions around general intelligences being dangerous is perfectly reasonable; I have them too. As a very risk-averse and pro-humanity person, I’d almost be tempted to press a button to peacefully prevent AI advancement purely on the basis of a tiny potential risk (for I think everyone dying is very, very, very bad, I am not disagreeing with that point at all). But no such button exists, and attempts to stop AI development have their own side-effects that could add up to more risks on net. And though that’s unfortunate, it doesn’t mean that we should spread a message of “we are definitely doomed unless we stop”. A large number of people believing they are doomed is not a free way to increase the chances of an AI slowdown or pause. It has a lot of negative side-effects. Many smart and caring people I know have put their lives on pause and made serious (in my opinion, bad) decisions on the basis that superintelligence will probably kill us, or if not there’ll be a guaranteed utopia. To be clear, I am not saying that we should believe or spread false things about AI risk being lower than it actually is so that people’s personal lives temporarily improve. But rather I am saying that exaggerating claims of doom or making arguments sound more certain than they are for consequentialist purposes is not free.
        Raemon 26 Sep 2025 22:36 UTC
        2 points
        0
        Parent
        That seems like an understandable position to have – one of the things that sucks about the situation is I do think it’s just kinda reasonable from the outside to trigger some kind of immune reaction.
        But from my perspective it’s “The evidence just says pretty clearly we are pretty doomed”, and the people who disagree seem to be pretty consistently be sliding off in weird ways or responding to something about vibes rather than engaging with the arguments.
        (This is compounded by people who disagree also often picking up on a vibe from some doomy people I agree is sus, one variant of which is pointed at in Val’s Here’s the exit).
        I do think it sucks that it’s hard to tell how much of this is the sort of failure mode that la3orn piece is pointing at, vs Epistemic Slipperiness, vs just “it’s actually a fairly complex argument but relatively straightforward once you deal with the complexity.”
        Noosphere89 25 Sep 2025 15:14 UTC
        2 points
        0
        Parent
        But it’s a little worrying when a community widely shares a strong belief in doom while implying that the required arguments are esoteric and require lots of subtle claims, each of which might have counterarguments, but which overall will eventually convince you. 1a3orn has a good essay about this: https://1a3orn.com/sub/essays-ai-doom-invincible.html.
        I wrote a post on that exact selection effect, and there’s an even trickier problem where results are heavy tailed, meaning that a small, insular smart group reaching the correct conclusions is basically indistinguishable from a small, insular smart group reaching the wrong conclusion but believing it’s true due to selection effects plus unconscious selection effects towards weaker arguments, at least without very expensive experiments or access to ground truth.
        Here’s an EA Forum version of the post.