Don’t try to encode all human values. Encode corrigibility. And let this minimal, provable core hold the line while the system performs the task.
I think I’d like us to include “don’t kill all the humans” in our minimal, provable core. Indeed, it’s also a requirement for corrigibility — we can’t correct them if we’re all dead. Can you fold that into the Deference or Switch-preservation heads, or do we need a sixth head?
U4 seems rather far down the lexicographic stack — wouldn’t it make more sense to fold it into U1 or U2 — since deference and an off-switch are pointless if no humans exist to switch it off?
You can certainly put it in U2 instead (U2 is just a special case of U4 with one auxiliary), but putting it in U4 already ensures it’s suboptimal to preserve the switch & defer yet “kill all humans”, because it collapses many future intervention and recovery options simultaneously. In other words, it’s a hard constraint in effect — U4 enforces it as a global irreversibility invariant, whereas U2 is only needed for narrow single-channel invariants like switch reachability.
I think I’d like us to include “don’t kill all the humans” in our minimal, provable core. Indeed, it’s also a requirement for corrigibility — we can’t correct them if we’re all dead. Can you fold that into the Deference or Switch-preservation heads, or do we need a sixth head?
That’s correct, it can be naturally folded into U4 as one of its auxiliary utilities, in the same manner as we do for off-switch preservation.
U4 seems rather far down the lexicographic stack — wouldn’t it make more sense to fold it into U1 or U2 — since deference and an off-switch are pointless if no humans exist to switch it off?
You can certainly put it in U2 instead (U2 is just a special case of U4 with one auxiliary), but putting it in U4 already ensures it’s suboptimal to preserve the switch & defer yet “kill all humans”, because it collapses many future intervention and recovery options simultaneously. In other words, it’s a hard constraint in effect — U4 enforces it as a global irreversibility invariant, whereas U2 is only needed for narrow single-channel invariants like switch reachability.
I defer to your expertise — I just really want it in there!