I think that there is a further distinction that might be drawn between “constitutive” and “evidential” interpretations of the reward signal.
On the constitutive interpretation, the reward signal is the thing being optimized for, call it intrinsic value. (This is what I take to be the textbook interpretation in RL). On the evidential interpretation, the reward signal is evidence of intrinsic value that the agent uses to update its expected value representations. On this view, the reward signal is more like a perceptual signal but with intrinsic value as its content.
Assuming the evidential interpretation, standard RL models like TD learning are a special case where the reward signal is perfectly accurate, and known to be as much. However, we could make a generalized model where intrinsic value is a latent variable that the value function estimates, and the reward signal is modelled as an observation thereof.
Why would any of this matter? I’ve been thinking that an agent that is uncertain about whether a reward signal is accurate about the thing it’s trying to optimize for would rationally hesitate to pursue any action with excessive conviction. This would create a rational pressure against irreversible actions that give up option value, given the risk that they might find that what they thought was valuable turned out not to be with more evidence.
I could see myself being convinced otherwise on this being an important distinction, but currently I think it might be an important and neglected one.
I think that there is a further distinction that might be drawn between “constitutive” and “evidential” interpretations of the reward signal.
On the constitutive interpretation, the reward signal is the thing being optimized for, call it intrinsic value. (This is what I take to be the textbook interpretation in RL). On the evidential interpretation, the reward signal is evidence of intrinsic value that the agent uses to update its expected value representations. On this view, the reward signal is more like a perceptual signal but with intrinsic value as its content.
Assuming the evidential interpretation, standard RL models like TD learning are a special case where the reward signal is perfectly accurate, and known to be as much. However, we could make a generalized model where intrinsic value is a latent variable that the value function estimates, and the reward signal is modelled as an observation thereof.
Why would any of this matter? I’ve been thinking that an agent that is uncertain about whether a reward signal is accurate about the thing it’s trying to optimize for would rationally hesitate to pursue any action with excessive conviction. This would create a rational pressure against irreversible actions that give up option value, given the risk that they might find that what they thought was valuable turned out not to be with more evidence.
I could see myself being convinced otherwise on this being an important distinction, but currently I think it might be an important and neglected one.