Joe Rogero comments on We won’t get AIs smart enough to solve alignment but too dumb to rebel

Joe Rogero 7 Oct 2025 16:23 UTC
1 point
0
It looks like we do agree on quite a lot of things. Not a surprise, but glad to see it laid out.
Why are you assuming the AI has misaligned goals?
The short, somewhat trite answer is that it’s baked into the premise. If we had a way of getting a powerful optimizer that didn’t have misaligned goals, we wouldn’t need said optimizer to solve alignment!
The more precise answer is that we can train for competence but not goodness, current LLMs have misaligned goals to the extent they have any at all, and this doesn’t seem likely to change.
Perhaps you will argue for this in the next post.
Yup. (Well, I’ll try; the whole conversation on that particular crux seems unusually muddled to me and it shows.)
I also think less capable AI systems can make meaningful progress on improving the situation, but this is partially tied up in thinking that it isn’t intractable to use a ton of human labor to do a good enough job aligning systems which are capable enough to dominate top human experts at these domains. As in, the way less capable AI systems help is by allowing us to (in effect) apply much more cognitive labor to the problem of sufficiently aligning AIs we can use to fully automate safety R&D (aka deferring to AIs , aka handing off to AIs).
The cruxy points here are, I think, “good enough” and “sufficiently”, and the underlying implication that partial progress on alignment can make capabilities much safer. I doubt this. A future post will touch on why.
To the extent you contest this, the remaining options are to use AI labor to buy much more time and/or to achieve substantial human augmentation.
Nitpick: Neither approach seems to require AI labor. I certainly use plenty of LLMs in my workflow, but maybe you’d have something more ambitious in mind.
More on several of these topics in the coming posts.
- ryan_greenblatt 7 Oct 2025 17:56 UTC
  5 points
  0
  Parent
  
  The short, somewhat trite answer is that it’s baked into the premise. If we had a way of getting a powerful optimizer that didn’t have misaligned goals, we wouldn’t need said optimizer to solve alignment!
  
  Sometimes, people argue for doom by noting that it would be hard for humans to directly align wildly superhuman AIs. I agree, but think it might be much easier to align systems which are only just capable enough to hand off relevant cognitive labor. Correspondingly, I often note this. Minimally, in the comment you linked in this post, I wasn’t trying to refer to systems which are misaligned but controlled, I was trying to refer to aligned systems.
  
  See also here.
  
  Perhaps you think arguments for alignment difficulty imply extreme difficulty of avoiding AIs scheming against us with basically prosaic methods even if the AIs are comparably capable to top human experts. I don’t really see why this is the case.
  
  Nitpick: Neither approach seems to require AI labor.
  
  Sure, I was just supporting the claim that “less capable AI systems can make meaningful progress on improving the situation”. You seemed to be implicitly arguing against this claim.
  - Joe Rogero 7 Oct 2025 20:16 UTC
    1 point
    0
    Parent
    Sometimes, people argue for doom by noting that it would be hard for humans to directly align wildly superhuman AIs. I agree, but think it might be much easier to align systems which are only just capable enough to hand off relevant cognitive labor. Correspondingly, I often note this. Minimally, in the comment you linked in this post, I wasn’t trying to refer to systems which are misaligned but controlled, I was trying to refer to aligned systems.
    ...huh. It seems to me that the fundamental problem in machine learning, that no one has a way to engineer specific goals into AI, applies equally well to weak AIs as to powerful ones. So this might be a key crux.
    To clarify, by “align systems...” did you mean the same thing I do, full-blown value alignment / human CEV? Is the theory in fact that we could get weak AIs who steer robustly and entirely towards human values, and would do so even on reflection; that we’d actually know how to do this reliably on purpose with practical engineering, but that such knowledge wouldn’t generalize to scaled-up versions of the same system? (My impression is that aligning even a weak AI that thoroughly requires understanding cognition on such a deep and fundamental level that it mostly would generalize to superintelligence, though of course it’d still be foolish to rely on this generalization alone.)
    If instead you meant something more like you described here, systems that are not “egregiously misaligned”, then that’s a different matter. But I get the sense it actually is the first thing, in this specific narrow case?
    Sure, I was just supporting the claim that “less capable AI systems can make meaningful progress on improving the situation”. You seemed to be implicitly arguing against this claim.
    I don’t think they can make meaningful progress on alignment without catastrophically dangerous levels of competence. That’s the main intended thrust of this particular post. (Separately, I don’t think the anticipation of possible second-order benefits, like using AIs for human augmentation so the humans can solve alignment, is worth letting labs continue either; I’d perhaps be in favor of narrow AIs for this purpose if such could be specified in a treaty without leaving enormous exploitable loopholes. Maybe it can.)
    - ryan_greenblatt 8 Oct 2025 0:27 UTC
      9 points
      3
      Parent
      
      To clarify, by “align systems...” did you mean the same thing I do, full-blown value alignment / human CEV?
      
      No, I mean “make AI robustly pursue the intended aims in practice when deferring to them on doing safety research and managing the situation” which IMO requires something much weaker, though for sufficiently powerful AIs, I do think it requires a mundane version of reflective stability. This would involve some version of corrigibility. Something like “avoid egregious misalignment / scheming” + “ensure the AI actually is robustly trying to pursue our interests on hard-to-check and open ended tasks”.
      
      I don’t think they can make meaningful progress on alignment without catastrophically dangerous levels of competence.
      
      Again, this might come down to a matter of how you are defining alignment. I think such systems can make progress on “for AIs somewhat more capable than top human experts, make these AIs robustly pursue the intended aims in practice when deferring to them on doing safety research and managing the situation”.
    - ryan_greenblatt 8 Oct 2025 0:28 UTC
      6 points
      0
      Parent
      
      Separately, I don’t think the anticipation of possible second-order benefits, like using AIs for human augmentation so the humans can solve alignment, is worth letting labs continue either
      
      I don’t generally think “should labs continue” is very cruxy from my perspective and I don’t think of myself as trying to argue about this. I’m trying to argue that marginal effort directly towards the broad hope I’m painting substantially reduces risk.