Thomas Larsen comments on Thomas Larsen’s Shortform

Thomas Larsen 17 Mar 2026 1:11 UTC
LW: 7 AF: 5
5
AF
I agree that AI successor-alignment is probably easier than the human AI alignment problem.
One additional difficulty for the AIs is that they need to solve the alignment problem in a way that humans won’t notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs). During the regime before human obsolescence, if we do a reasonable job at control, I think it’ll be hard for them to pull that off.
- Kaarel 17 Mar 2026 1:49 UTC
  7 points
  2
  Parent
  they need to solve the alignment problem in a way that humans won’t notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs)
  
  their “solution to alignment” (ie way to make a smarter version that is fine to make) could easily be something we cannot use. eg “continue learning” or “make another version of myself with this hyperparam changed”. also it seems unlikely that anything bad would happen to the AIs even if we noticed them doing that (given that having AIs create smarter AIs ^[1] is the main plan of labs anyway)
  
  also on this general topic: https://www.lesswrong.com/posts/CFA8W6WCodEZdjqYE?commentId=WW5syXYpmXdX3yoHw
  1. ↩︎
    which is occasionally called “asking AIs to solve alignment”