ryan_greenblatt comments on We won’t get AIs smart enough to solve alignment but too dumb to rebel

ryan_greenblatt 8 Oct 2025 0:27 UTC
9 points
3

To clarify, by “align systems...” did you mean the same thing I do, full-blown value alignment / human CEV?

No, I mean “make AI robustly pursue the intended aims in practice when deferring to them on doing safety research and managing the situation” which IMO requires something much weaker, though for sufficiently powerful AIs, I do think it requires a mundane version of reflective stability. This would involve some version of corrigibility. Something like “avoid egregious misalignment / scheming” + “ensure the AI actually is robustly trying to pursue our interests on hard-to-check and open ended tasks”.

I don’t think they can make meaningful progress on alignment without catastrophically dangerous levels of competence.

Again, this might come down to a matter of how you are defining alignment. I think such systems can make progress on “for AIs somewhat more capable than top human experts, make these AIs robustly pursue the intended aims in practice when deferring to them on doing safety research and managing the situation”.