TurnTrout comments on Richard Ngo’s Shortform

TurnTrout 20 Dec 2022 19:49 UTC
LW: 4 AF: 4
0
AF
What I meant is that generalizing to want reward is in some sense the model generalizing “correctly;” we could get lucky and have it generalize “incorrectly” in an important sense in a way that happens to be beneficial to us.
See also: Inner and outer alignment decompose one hard problem into two extremely hard problems (in particular: Inner alignment seems anti-natural).