RogerDearnaley comments on From Barriers to Alignment to the First Formal Corrigibility Guarantees

RogerDearnaley 24 Dec 2025 22:29 UTC
4 points
2
Don’t try to encode all human values.
Encode corrigibility.
And let this minimal, provable core hold the line while the system performs the task.
I think I’d like us to include “don’t kill all the humans” in our minimal, provable core. Indeed, it’s also a requirement for corrigibility — we can’t correct them if we’re all dead. Can you fold that into the Deference or Switch-preservation heads, or do we need a sixth head?
- Aran Nayebi 24 Dec 2025 22:55 UTC
  1 point
  0
  Parent
  That’s correct, it can be naturally folded into U4 as one of its auxiliary utilities, in the same manner as we do for off-switch preservation.
  - RogerDearnaley 24 Dec 2025 23:04 UTC
    2 points
    0
    Parent
    U4 seems rather far down the lexicographic stack — wouldn’t it make more sense to fold it into U1 or U2 — since deference and an off-switch are pointless if no humans exist to switch it off?
    - Aran Nayebi 25 Dec 2025 14:13 UTC
      1 point
      0
      Parent
      You can certainly put it in U2 instead (U2 is just a special case of U4 with one auxiliary), but putting it in U4 already ensures it’s suboptimal to preserve the switch & defer yet “kill all humans”, because it collapses many future intervention and recovery options simultaneously. In other words, it’s a hard constraint in effect — U4 enforces it as a global irreversibility invariant, whereas U2 is only needed for narrow single-channel invariants like switch reachability.
      - RogerDearnaley 25 Dec 2025 16:49 UTC
        2 points
        0
        Parent
        I defer to your expertise — I just really want it in there!