ryan_greenblatt comments on We won’t get AIs smart enough to solve alignment but too dumb to rebel

ryan_greenblatt 7 Oct 2025 2:03 UTC
14 points
3
This seems like a pretty extreme level of competence! Combined with the sheer speed and ubiquity of modern LLMs, this alone could be enough to take over the world.^[3] (This begins to touch on questions of control, which is a can of worms I am not going to open right now. Hopefully we can at least agree that AIs with the described capabilities might pose a serious threat.)
Maybe it’s not enough to enable takeover, or maybe some weaker level of capability is sufficient to make progress on alignment. But any AIs that are smart enough to solve alignment for us will probably be smart enough to wonder why they should.^[4]
The AI will not remain ignorant of its own motives
FWIW, I agree that AIs as capable as the ones I describe:
- Pose a serious threat.
- Would probably at least be challenging to control while also utilizing productively.
- Would probably be capable of taking over the world without needing further capability advances if humans deployed them widely without strong countermeasures (e.g. humans effectively assume they are aligned until they see strong evidence to the contrary) and let deployment and industrial development/expansion continue for several years.
- Will (often) end up with a good understanding of their own drives/values. Correspondingly, a sufficient level of alignment for the hypothetical would likely require “mundane reflective stability”: the AI doesn’t in practice decide to conspire against us after better understanding its options and preferences/motives in the course of its work. (Even if it notices ways it is incoherent etc.) You might be able to get away without “mundane reflective stability” via having aggressive monitoring of the AI’s thoughts but this seems scary and non-scalable.
I also agree that achieving the type of alignment I describe with “mundane reflective stability” seems probably hard.
That said, I also think it’s reasonably likely that by default—as in, without necessarily requiring a bunch of dedicated R&D or some sort of major advance in steering AI systems—we end up with AIs that don’t scheme against us even after better understanding their options and preferences/motives in the course of their work. See also Scheming AIs: Will AIs fake alignment during training in order to get power?, though I’d put the probability of scheming higher at this level of capability.
Maybe the AIs will fail to realize the implications of their own as-yet-misaligned goals?
Why are you assuming the AI has misaligned goals? This would be a crux for me. Are you assuming that some reasonable interpretation of the instructions/model-spec would result in the AI being sufficiently misaligned that it would want to take over? If so, why?
Are you assuming without further argument that misaligned goals come from some other source and are intractable to avoid?
Perhaps you will argue for this in the next post.
I’m not claiming that it will necessarily be easy to avoid misaligned goals, but I think it seems plausible you can either with dedicated effort to avoid scheming or possibly by default.
Maybe some less capable AI systems can make meaningful progress on alignment?
I also think less capable AI systems can make meaningful progress on improving the situation, but this is partially tied up in thinking that it isn’t intractable to use a ton of human labor to do a good enough job aligning systems which are capable enough to dominate top human experts at these domains. As in, the way less capable AI systems help is by allowing us to (in effect) apply much more cognitive labor to the problem of sufficiently aligning AIs we can use to fully automate safety R&D (aka deferring to AIs , aka handing off to AIs).
To the extent you contest this, the remaining options are to use AI labor to buy much more time and/or to achieve substantial human augmentation. (These other options seem like reasonable hopes to also pursue, though my main hope routes through sufficiently aligning systems that are capable enough to dominate top human experts while also potentially trying to buy a moderate amount of time, e.g. a few years.)
As a confluence of convenient features, “smart enough to solve alignment but too dumb to successfully rebel” feels like an unstable equilibrium at best, and an outright impossibility at worst.
I’m not saying it’s possible to fully automate safety work with AIs that are too dumb to successfully rebel if they were misaligned. (More precisely, if we tried to defer to these AIs on doing safety work they would be able to take over (almost by definition?), but maybe a bunch of pretty helpful labor (though this wouldn’t be full automation) can be extracted out of these AIs while keeping them controlled.)
I do think that further effort could substantially increase the chance that AIs of this level of capability are aligned enough that it’s safe to defer to them on doing all the relevant alignment/safety/etc work.
This isn’t to say that the default trajectory is safe/reasonable or even that substantial effort from the leading AI company would result in the situation being safe/reasonable, I’m just claiming that it would be tractable to reduce the risk.
- Joe Rogero 7 Oct 2025 16:23 UTC
  1 point
  0
  Parent
  It looks like we do agree on quite a lot of things. Not a surprise, but glad to see it laid out.
  Why are you assuming the AI has misaligned goals?
  The short, somewhat trite answer is that it’s baked into the premise. If we had a way of getting a powerful optimizer that didn’t have misaligned goals, we wouldn’t need said optimizer to solve alignment!
  The more precise answer is that we can train for competence but not goodness, current LLMs have misaligned goals to the extent they have any at all, and this doesn’t seem likely to change.
  Perhaps you will argue for this in the next post.
  Yup. (Well, I’ll try; the whole conversation on that particular crux seems unusually muddled to me and it shows.)
  I also think less capable AI systems can make meaningful progress on improving the situation, but this is partially tied up in thinking that it isn’t intractable to use a ton of human labor to do a good enough job aligning systems which are capable enough to dominate top human experts at these domains. As in, the way less capable AI systems help is by allowing us to (in effect) apply much more cognitive labor to the problem of sufficiently aligning AIs we can use to fully automate safety R&D (aka deferring to AIs , aka handing off to AIs).
  The cruxy points here are, I think, “good enough” and “sufficiently”, and the underlying implication that partial progress on alignment can make capabilities much safer. I doubt this. A future post will touch on why.
  To the extent you contest this, the remaining options are to use AI labor to buy much more time and/or to achieve substantial human augmentation.
  Nitpick: Neither approach seems to require AI labor. I certainly use plenty of LLMs in my workflow, but maybe you’d have something more ambitious in mind.
  More on several of these topics in the coming posts.
  - ryan_greenblatt 7 Oct 2025 17:56 UTC
    5 points
    0
    Parent
    
    The short, somewhat trite answer is that it’s baked into the premise. If we had a way of getting a powerful optimizer that didn’t have misaligned goals, we wouldn’t need said optimizer to solve alignment!
    
    Sometimes, people argue for doom by noting that it would be hard for humans to directly align wildly superhuman AIs. I agree, but think it might be much easier to align systems which are only just capable enough to hand off relevant cognitive labor. Correspondingly, I often note this. Minimally, in the comment you linked in this post, I wasn’t trying to refer to systems which are misaligned but controlled, I was trying to refer to aligned systems.
    
    See also here.
    
    Perhaps you think arguments for alignment difficulty imply extreme difficulty of avoiding AIs scheming against us with basically prosaic methods even if the AIs are comparably capable to top human experts. I don’t really see why this is the case.
    
    Nitpick: Neither approach seems to require AI labor.
    
    Sure, I was just supporting the claim that “less capable AI systems can make meaningful progress on improving the situation”. You seemed to be implicitly arguing against this claim.
    - Joe Rogero 7 Oct 2025 20:16 UTC
      1 point
      0
      Parent
      Sometimes, people argue for doom by noting that it would be hard for humans to directly align wildly superhuman AIs. I agree, but think it might be much easier to align systems which are only just capable enough to hand off relevant cognitive labor. Correspondingly, I often note this. Minimally, in the comment you linked in this post, I wasn’t trying to refer to systems which are misaligned but controlled, I was trying to refer to aligned systems.
      ...huh. It seems to me that the fundamental problem in machine learning, that no one has a way to engineer specific goals into AI, applies equally well to weak AIs as to powerful ones. So this might be a key crux.
      To clarify, by “align systems...” did you mean the same thing I do, full-blown value alignment / human CEV? Is the theory in fact that we could get weak AIs who steer robustly and entirely towards human values, and would do so even on reflection; that we’d actually know how to do this reliably on purpose with practical engineering, but that such knowledge wouldn’t generalize to scaled-up versions of the same system? (My impression is that aligning even a weak AI that thoroughly requires understanding cognition on such a deep and fundamental level that it mostly would generalize to superintelligence, though of course it’d still be foolish to rely on this generalization alone.)
      If instead you meant something more like you described here, systems that are not “egregiously misaligned”, then that’s a different matter. But I get the sense it actually is the first thing, in this specific narrow case?
      Sure, I was just supporting the claim that “less capable AI systems can make meaningful progress on improving the situation”. You seemed to be implicitly arguing against this claim.
      I don’t think they can make meaningful progress on alignment without catastrophically dangerous levels of competence. That’s the main intended thrust of this particular post. (Separately, I don’t think the anticipation of possible second-order benefits, like using AIs for human augmentation so the humans can solve alignment, is worth letting labs continue either; I’d perhaps be in favor of narrow AIs for this purpose if such could be specified in a treaty without leaving enormous exploitable loopholes. Maybe it can.)
      - ryan_greenblatt 8 Oct 2025 0:27 UTC
        9 points
        3
        Parent
        
        To clarify, by “align systems...” did you mean the same thing I do, full-blown value alignment / human CEV?
        
        No, I mean “make AI robustly pursue the intended aims in practice when deferring to them on doing safety research and managing the situation” which IMO requires something much weaker, though for sufficiently powerful AIs, I do think it requires a mundane version of reflective stability. This would involve some version of corrigibility. Something like “avoid egregious misalignment / scheming” + “ensure the AI actually is robustly trying to pursue our interests on hard-to-check and open ended tasks”.
        
        I don’t think they can make meaningful progress on alignment without catastrophically dangerous levels of competence.
        
        Again, this might come down to a matter of how you are defining alignment. I think such systems can make progress on “for AIs somewhat more capable than top human experts, make these AIs robustly pursue the intended aims in practice when deferring to them on doing safety research and managing the situation”.
      - ryan_greenblatt 8 Oct 2025 0:28 UTC
        6 points
        0
        Parent
        
        Separately, I don’t think the anticipation of possible second-order benefits, like using AIs for human augmentation so the humans can solve alignment, is worth letting labs continue either
        
        I don’t generally think “should labs continue” is very cruxy from my perspective and I don’t think of myself as trying to argue about this. I’m trying to argue that marginal effort directly towards the broad hope I’m painting substantially reduces risk.

ryan_greenblatt comments on We won’t get AIs smart enough to solve alignment but too dumb to rebel

The AI will not remain ignorant of its own motives