What I meant is that generalizing to want reward is in some sense the model generalizing “correctly;” we could get lucky and have it generalize “incorrectly” in an important sense in a way that happens to be beneficial to us.
See also: Inner and outer alignment decompose one hard problem into two extremely hard problems (in particular: Inner alignment seems anti-natural).
See also: Inner and outer alignment decompose one hard problem into two extremely hard problems (in particular: Inner alignment seems anti-natural).