TurnTrout comments on Non-Obstruction: A Simple Concept Motivating Corrigibility

TurnTrout 28 Dec 2020 23:36 UTC
LW: 2 AF: 1
0
AF
“Bad things don’t happen” and “good things happen” seem quite different, e.g. a rock is Evan-aligned but not Alex-impact-aligned.
To rephrase: Alex(/Critch)-impact-alignment is about (strictly) increasing value, non-obstruction is about non-strict value increase, and Evan-alignment is about not taking actions we would judge to significantly decrease value (making it more similar to non-obstruction, except wrt our expectations about the consequences of actions).
I’d also like to flag that Evan’s definition involves (hypothetical) humans evaluating the actions, while my definition involves evaluating the outcomes. Whenever we’re reasoning about non-trivial scenarios using my definition, though, it probably doesn’t matter. That’s because we would have to reason using our beliefs about the consequences of different kinds of actions.
However, the different perspectives might admit different kinds of theorems, and we could perhaps reason using those, and so perhaps the difference matters after all.