Joe Rogero comments on We won’t get AIs smart enough to solve alignment but too dumb to rebel

Joe Rogero 7 Oct 2025 20:16 UTC
1 point
0
Sometimes, people argue for doom by noting that it would be hard for humans to directly align wildly superhuman AIs. I agree, but think it might be much easier to align systems which are only just capable enough to hand off relevant cognitive labor. Correspondingly, I often note this. Minimally, in the comment you linked in this post, I wasn’t trying to refer to systems which are misaligned but controlled, I was trying to refer to aligned systems.
...huh. It seems to me that the fundamental problem in machine learning, that no one has a way to engineer specific goals into AI, applies equally well to weak AIs as to powerful ones. So this might be a key crux.
To clarify, by “align systems...” did you mean the same thing I do, full-blown value alignment / human CEV? Is the theory in fact that we could get weak AIs who steer robustly and entirely towards human values, and would do so even on reflection; that we’d actually know how to do this reliably on purpose with practical engineering, but that such knowledge wouldn’t generalize to scaled-up versions of the same system? (My impression is that aligning even a weak AI that thoroughly requires understanding cognition on such a deep and fundamental level that it mostly would generalize to superintelligence, though of course it’d still be foolish to rely on this generalization alone.)
If instead you meant something more like you described here, systems that are not “egregiously misaligned”, then that’s a different matter. But I get the sense it actually is the first thing, in this specific narrow case?
Sure, I was just supporting the claim that “less capable AI systems can make meaningful progress on improving the situation”. You seemed to be implicitly arguing against this claim.
I don’t think they can make meaningful progress on alignment without catastrophically dangerous levels of competence. That’s the main intended thrust of this particular post. (Separately, I don’t think the anticipation of possible second-order benefits, like using AIs for human augmentation so the humans can solve alignment, is worth letting labs continue either; I’d perhaps be in favor of narrow AIs for this purpose if such could be specified in a treaty without leaving enormous exploitable loopholes. Maybe it can.)
- ryan_greenblatt 8 Oct 2025 0:27 UTC
  9 points
  3
  Parent
  
  To clarify, by “align systems...” did you mean the same thing I do, full-blown value alignment / human CEV?
  
  No, I mean “make AI robustly pursue the intended aims in practice when deferring to them on doing safety research and managing the situation” which IMO requires something much weaker, though for sufficiently powerful AIs, I do think it requires a mundane version of reflective stability. This would involve some version of corrigibility. Something like “avoid egregious misalignment / scheming” + “ensure the AI actually is robustly trying to pursue our interests on hard-to-check and open ended tasks”.
  
  I don’t think they can make meaningful progress on alignment without catastrophically dangerous levels of competence.
  
  Again, this might come down to a matter of how you are defining alignment. I think such systems can make progress on “for AIs somewhat more capable than top human experts, make these AIs robustly pursue the intended aims in practice when deferring to them on doing safety research and managing the situation”.
- ryan_greenblatt 8 Oct 2025 0:28 UTC
  6 points
  0
  Parent
  
  Separately, I don’t think the anticipation of possible second-order benefits, like using AIs for human augmentation so the humans can solve alignment, is worth letting labs continue either
  
  I don’t generally think “should labs continue” is very cruxy from my perspective and I don’t think of myself as trying to argue about this. I’m trying to argue that marginal effort directly towards the broad hope I’m painting substantially reduces risk.