Thanks Stuart for the example. There are two ways to distinguish the cases where the agent should and shouldn’t kick the bucket:
Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward. For example, if the agent’s goal is to put out a fire on the other end of the pool, you would set a low weight on the impact penalty, which enables the agent to take irreversible actions in order to achieve the goal. This is why impact measures use a reward-penalty tradeoff rather than a constraint on irreversible actions.
Absolute value of the bucket contents can be represented by adding weights on the reachable states or attainable utility functions. This doesn’t necessarily require defining human preferences or providing human input, since human preferences can be inferred from the starting state. I generally think that impact measures don’t have to be value-agnostic, as long as they require less input about human preferences than the general value learning problem.
Proposal: in the same way we might try to infer human values from the state of the world, might we be able to infer a high-level set of features such that existing agents like us seem to optimize simple functions of these features? Then we would penalize actions that cause irreversible changes with respect to these high-level features.
This might be entirely within the framework of similarity-based reachability. This might also be exactly what you were just suggesting.
Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward.
Yep, I agree :-)
I generally think that impact measures don’t have to be value-agnostic, as long as they require less input about human preferences than the general value learning problem.
Then we are in full agreement :-) I argue that low impact, corrigibility, and similar approaches, require some but not all of human preferences. “some” because of arguments like this one; “not all” because humans with very different values can agree on what constitutes low impact, so only part of their values are needed.
Thanks Stuart for the example. There are two ways to distinguish the cases where the agent should and shouldn’t kick the bucket:
Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward. For example, if the agent’s goal is to put out a fire on the other end of the pool, you would set a low weight on the impact penalty, which enables the agent to take irreversible actions in order to achieve the goal. This is why impact measures use a reward-penalty tradeoff rather than a constraint on irreversible actions.
Absolute value of the bucket contents can be represented by adding weights on the reachable states or attainable utility functions. This doesn’t necessarily require defining human preferences or providing human input, since human preferences can be inferred from the starting state. I generally think that impact measures don’t have to be value-agnostic, as long as they require less input about human preferences than the general value learning problem.
Proposal: in the same way we might try to infer human values from the state of the world, might we be able to infer a high-level set of features such that existing agents like us seem to optimize simple functions of these features? Then we would penalize actions that cause irreversible changes with respect to these high-level features.
This might be entirely within the framework of similarity-based reachability. This might also be exactly what you were just suggesting.
Yep, I agree :-)
Then we are in full agreement :-) I argue that low impact, corrigibility, and similar approaches, require some but not all of human preferences. “some” because of arguments like this one; “not all” because humans with very different values can agree on what constitutes low impact, so only part of their values are needed.