Aran Nayebi comments on From Barriers to Alignment to the First Formal Corrigibility Guarantees

Aran Nayebi 8 Dec 2025 17:55 UTC
3 points
1
Thanks! I really appreciate this, and I think your natural-latents framing fits nicely with the Part I point about needing to compress D down to a small set of crisp, structured latents. On the lexicographic point: it’s worth noting that even though Theorem 3 writes the full objective as a discounted sum, the safety heads U1-U4 aren’t long-horizon objectives — they’re local one-step tests whose optimal action doesn’t depend on future predictions. For example, U1 is automatically satisfied each round by waiting (and once the human approves the proposed action, the agent simply executes it, thereby engaging U2-U5), and U4 is a one-step reversibility check against an inaction baseline, not a long-run impact estimate. The only head with genuine long-horizon structure is U5, which sits below the safety heads, so discounting never creates optimization pressure on them. This makes the whole scheme intentionally deontic and “natural-latent–friendly”, exactly matching the tractable regime suggested by the large-D lower bounds of Part I.
- I.M.J. McInnis 8 Dec 2025 18:46 UTC
  1 point
  0
  Parent
  Yeah, I agree that your formalism achieves what we want. The challenge is getting an actual AI that is appropriately myopic with respect to U1-U4. And of things about which an AI could obtain certificates of determinacy, its own one-step actions seem most likely.