Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward.
Yep, I agree :-)
I generally think that impact measures don’t have to be value-agnostic, as long as they require less input about human preferences than the general value learning problem.
Then we are in full agreement :-) I argue that low impact, corrigibility, and similar approaches, require some but not all of human preferences. “some” because of arguments like this one; “not all” because humans with very different values can agree on what constitutes low impact, so only part of their values are needed.
Yep, I agree :-)
Then we are in full agreement :-) I argue that low impact, corrigibility, and similar approaches, require some but not all of human preferences. “some” because of arguments like this one; “not all” because humans with very different values can agree on what constitutes low impact, so only part of their values are needed.